The Heptalogia, or the Seven against Sense. A Cap with Seven Bells (1880)
Tristram of Lyonesse (1882)
A Century of Roundels (1883)
A Midsummer Holiday and Other Poems (1884)
Poems and Ballads, Third Series (1889)
Astrophel and Other Poems (1894)
The Tale of Balen (1896)
A Channel Passage and Other Poems (1904)
Data: Michael Field
Long Ago (1889)
Sight and Song (1892)
Underneath the Bough (1893)
Wild Honey from Various Thyme (1908)
Poems of Adoration (1912)
Mystic Trees (1913)
Whym Chow: Flame of Love (1914)
Katherine Bradley & Edith Cooper, aka Michael Field
Data: Thomas Hardy
Wessex Poems and Other Verses (1898)
Poems of the Past and the Present (1901)
Time’s Laughingstocks and Other Verses (1909)
Satires of Circumstance (1914)
Moments of Vision (1917)
Late Lyrics and Earlier with Many Other Verses (1922)
Human Shows, Far Phantasies, Songs and Trifles (1925)
Who Uses “Re-” Words More / Less?
Keyword Frequency Analysis
What percent of each poet’s corpus is composed of words with the prefix “re-”?
Compare “Re-” Word Percentages
Code
# import software librariesimport nltkimport osimport matplotlib.pyplot as plt# Download NLTK tokenizer datanltk.download('punkt') # Define a function that tokenizes a text and then counts how many of its words start with "re-"def count_re_words(text): words = nltk.word_tokenize(text)returnsum(1for word in words if word.lower().startswith("re-"))# Specify the directory paths for the two poets' corporacorpus_directories = {'swinburne': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP','hardy': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP','michael field': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP','dg rossetti': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP',}# Initialize dictionary to store the resultspercentage_re_words = {}# create an empty list for each poet/directoryfor poet, corpus_directory in corpus_directories.items(): corpus = []# Read the text files in the poet's corpus and add them to the apt list abovefor filename in os.listdir(corpus_directory):withopen(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') asfile: text =file.read() corpus.append(text)# Apply fuction above to tokenize + count the "re-" words in each text of the poet's corpus and store count re_word_count =sum(count_re_words(text) for text in corpus)# Calculate the corpus word count by taking each text, tokenizing it, counting the number of tokens, and then adding them to create a total word count total_words =sum(len(nltk.word_tokenize(text)) for text in corpus)# Calculate the percentage of "re-" words in the poet's corpus by dividing the re- word count by the total word count and multiply by 100 percentage_re_words[poet] = (re_word_count / total_words) *100# Sort the results from largest to smallestsorted_results =sorted(percentage_re_words.items(), key=lambda x: x[1], reverse=True)# Extract poets and percentages for plottingpoets, percentages =zip(*sorted_results)# Step 4: visualize the results in a bar chart plt.figure(figsize=(8, 6))plt.bar(poets, percentages, color=['blue', 'orange'])plt.ylabel('Percentage (%)')plt.title('Whose Poetry is More Composed of Words that Start with "Re-"?')# Set the y-axis limit based on the largest percentageylim_percentage =max(percentages) *2# Adjusted for better visualizationplt.ylim(0, ylim_percentage)plt.grid(axis='y', linestyle='--', alpha=0.7)# Display the bar chart with the poet's names angled to avoid overlapplt.xticks(rotation=45)plt.tight_layout()plt.show()
Who Uses “Re-” Words More / Less?
Which “Re-” Words Are Most Frequent?
Keyword Frequency Analysis
Which “re-” words are used and how often?
Code
# import software libraries / dependenciesimport nltkimport osimport reimport matplotlib.pyplot as pltfrom collections import Counterfrom nltk.stem import SnowballStemmer# Download NLTK tokenizer datanltk.download('punkt') # Initialize Stemmer for Englishstemmer = SnowballStemmer("english") # create function to preprocess and process files of each directory: create an empty list for each directory/poetdef process_directory(corpus_directory): corpus = []# Get the directory (poet's) name to use as the label label = os.path.basename(corpus_directory)# Step 1: Read the text of corpus from files and add it to apt (poet's) list, created abovefor filename in os.listdir(corpus_directory):withopen(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') asfile: text =file.read() corpus.append(text)# Step 2: Tokenize each text into individual words tokenized_corpus = [nltk.word_tokenize(text) for text in corpus]# Step 3: find words that when lower cased start with re-, standardize (lower) their case, stem them, retain them stemmed_corpus = []for tokens in tokenized_corpus: stemmed_tokens = [stemmer.stem(word.lower()) for word in tokens if re.match(r'\b(re-)\w+', word.lower())] stemmed_corpus.append(stemmed_tokens)# Step 6: Count the frequency of each re- word word_counts = Counter(word for tokens in stemmed_corpus for word in tokens)# Step 7: Display the most frequent words# most_common_re_words = word_counts.most_common(20) # Set the desired number of top words# for word, count in most_common_re_words:# print(f'The poetry of {label[:-5]} uses {word}: {count} times')# Step 8: Plot top 20 most frequent words and counts in a bar chart, use lable to title plot plt.figure(figsize=(10, 5)) top_words, top_counts =zip(*word_counts.most_common(20)) plt.bar(top_words, top_counts) plt.title(f'Frequency of Stemmed Re- Words in {label[:-5].capitalize()}\'s Poetry') plt.xticks(rotation=65) plt.show()# Directories to processcorpus_directories = ['/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP','/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP','/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP','/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP',# Add more directories here]# Process each directoryfor directory in corpus_directories: process_directory(directory)
“Two Rosalinds”: Time’s Laughingstocks and Other Verses (1909)
“For Life I had never cared greatly”: Moments of Vision and Miscellaneous Verses (1917)
Do “Re-” Words and “Re” Words Co-occur in Hardy’s Poetry?
Key (Re-) Word in Context (KWIC)
Have Python return Hardy’s sentences that contain word(s) that start with “re-” and word(s) that start with “re”
Code
# import librariesimport os # open directories on local machineimport nltk # nlpimport re # reg exfrom nltk.tokenize import sent_tokenize, word_tokenize # break text blob into individual sentences and words# Define regular expressionsre_pattern =r'\b(?:[Rr]e|[Rr]E)\w+\b'# Matches "re" or "Re" or "rE" or "RE" followed by one or more letters (not hyphen)re_hyphen_pattern =r'\b(?:[Rr]e-|[Rr]E-)\w+\b'# Matches "re-" or "Re-" or "rE-" or "RE-" followed by one or more letters# \b matches a word boundary.# (?:[Rr]e|[Rr]E) is a non-capturing group that matches either "re" or "Re" or "rE" or "RE" # \w+ matches one or more word characters following "re" or "Re" or "rE" or "RE" def matches_pattern(word, pattern):return re.search(pattern, word, re.IGNORECASE) isnotNone# Define function to surround matched words with asterisksdef bold_matched_words(match, pattern):returnf'**{match.group()}**'# Directory containing text filestext_directory ="/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP"# Iterate through text files in the directory, open text files, read in text, break up text into sentences, break up sentences into words, id those sentences with re/re- words, if sentence has re/re-, add asterisks around re-/re words and print sentencefor filename in os.listdir(text_directory):if filename.endswith(".txt"):withopen(os.path.join(text_directory, filename), "r", encoding="utf-8") asfile: text =file.read() sentences = sent_tokenize(text)for sentence in sentences: words = word_tokenize(sentence) has_re =any(matches_pattern(word, re_pattern) for word in words) has_re_hyphen =any(matches_pattern(word, re_hyphen_pattern) for word in words)if has_re and has_re_hyphen: sentence_with_bold = re.sub(re_pattern, lambda x: bold_matched_words(x, re_pattern), sentence) sentence_with_bold = re.sub(re_hyphen_pattern, lambda x: bold_matched_words(x, re_hyphen_pattern), sentence_with_bold)print(sentence_with_bold)print("\n"*2)
Do “Re-” Words and “Re” Words Co-occur in Hardy’s Poetry?
Within there
Too mocking to Love’s **re-expression**
Was Time’s **repartee**!
IX
“The words, sir?” cried a creature
Hovering mid the shine and shade as ’twixt the live world and the
tomb;
But the well-known numbers needed not for me a text or teacher
To **revive** and **re-illume**.
THE FLIRT’S TRAGEDY
(17–)
HERE alone by the logs in my chamber,
Deserted, decrepit—
Spent flames limning ghosts on the wainscot
Of friends I once knew—
My drama and hers begins weirdly
Its dumb **re-enactment**,
Each scene, sigh, and circumstance passing
In spectral **review**.
I take my holiday then and my **rest**
Away from the dun life here about me,
Old hours **re-greeting**
With the quiet sense that bring they must
Such throbs as at first, till I house with dust,
And in the numbness my heartsome zest
For things that were, be past **repeating**
When spring comes round.
And so, the rough highway forgetting,
I pace hill and dale
**Regarding** the sky,
**Regarding** the vision on high,
And thus **re-illumed** have no humour for letting
My pilgrimage fail.
So I don’t want to linger in this **re-decked** dwelling,
I feel too uneasy at the contrasts I behold,
And I make again for Mellstock to **return** here never,
And **rejoin** the roomy silence, and the mute and manifold
Souls of old.
. .
Anon I shall break it and bury its fragments
Where my grave is to be.”
THE **RE-ENACTMENT**
BETWEEN the folding sea-downs,
In the gloom
Of a wailful wintry nightfall,
When the boom
Of the ocean, like a hammering in a hollow tomb,
Throbbed up the copse-clothed valley
From the shore
To the chamber where I darkled,
Sunk and sore
With gray ponderings why my Loved one had not come before
To salute me in the dwelling
That of late
I had hired to waste a while in—
Vague of date,
Quaint, and **remote**—wherein I now expectant sate;
On the solitude, unsignalled,
Broke a man
Who, in air as if at home there,
Seemed to scan
Every fire-flecked nook of the apartment span by span.
When Are “Re-” Words More / Less Frequent?
Time Series / Term Frequency over Time
Y axis: re- word percentage of book
X axis: publication year of book
Code
# Import software librariesimport nltk # nlpimport os # interact with directories on local machineimport re # reg eximport matplotlib.pyplot as plt # data visualizationimport string # punctuation removal# Download NLTK tokenizer datanltk.download('punkt') # Define a function that takes each text, tokenizes it into individual words, and count words that when lowercased start with "re-"def count_re_words(text): words = nltk.word_tokenize(text)returnsum(1for word in words if word.lower().startswith("re-"))# Define a function to remove all punctuation except hyphensdef remove_punctuation_except_hyphens(text): translator =str.maketrans('', '', string.punctuation.replace('-', ''))return text.translate(translator)# Specify the parent directory containing multiple text file directoriesparent_directory ='/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/'# Soecific irectories to processall_directory_names = ['swinburne/swinburne_noBP','hardy/hardy_noBP','field/field_NoBP',# 'rossetti_dg/rossetti_dg_NoBP',]# Initialize a dictionary to store year-wise percentagesyear_percentages = {}# Process each directoryfor directory_name in all_directory_names:# Construct the full path to the current directory text_files_directory = os.path.join(parent_directory, directory_name)# Initialize two dictionaries: one to store year-wise percentages for the current directory and one to store file names for the current directoryif os.path.isdir(text_files_directory): directory_percentages = {} file_names = {}# Process each text file in the current directoryfor filename in os.listdir(text_files_directory):if filename.endswith('.txt'):# Extract the year from the filename using a regular expression year_match = re.match(r'(\d{4})_', filename)if year_match: year =int(year_match.group(1))# read in text of each book of poetrywithopen(os.path.join(text_files_directory, filename), 'r', encoding='utf-8') asfile: text =file.read()# Remove punctuation except hyphens text = remove_punctuation_except_hyphens(text)# tokenize text, count the total number of words, count the number of re- words, normalize counts total_words =len(nltk.word_tokenize(text)) re_word_count = count_re_words(text) percentage_re_words = (re_word_count / total_words) *100#add to dictionary: value: count, key: publication year directory_percentages[year] = percentage_re_words# Extract the text between underscores in the filename file_name_parts = filename.split('_')iflen(file_name_parts) >2: file_name ='_'.join(file_name_parts[1:-1])else: file_name = file_name_parts[1]#add to dictionary: value: filename, key: years file_names[year] = file_name # Store the extracted file name# Sort the dictionary by keys (years) for the current directory sorted_directory_percentages = {year: directory_percentages[year] for year insorted(directory_percentages)}# Store the results for the current directory year_percentages[directory_name] = {'percentages': sorted_directory_percentages,'file_names': file_names # Store file names for this directory }# Step 2: Plot the keys (years) and values (percentages) in a line graph for each directoryplt.figure(figsize=(10, 8))for directory_name, data in year_percentages.items(): percentages = data['percentages'] file_names = data['file_names'] years =list(percentages.keys()) percentages =list(percentages.values())# Plot the data points plt.plot(years, percentages, marker='o', linestyle='-', label=directory_name[:-5])# Add annotations for each data point if the value is greater than 0for year, percentage inzip(years, percentages):if percentage >0: file_name = file_names[year] annotation =f"{file_name}" plt.annotate(annotation, (year, percentage), textcoords="offset points", xytext=(0, 10), ha='center')plt.ylim(0, 0.0525)plt.xlabel('Year')plt.ylabel('Percentage of Words Starting with "Re-"')plt.title('When Are "Re-" Words Used in Field\'s, Hardy\'s, and Swinburne\'s Poetry?')plt.xticks(rotation=45)plt.grid(True)plt.legend() # Show legend indicating directory names# Display the line graphplt.tight_layout()plt.show()
When Are “Re-” Words More / Less Frequent?
Are “Re-” Words Used Positively or Negatively?
Sentiment Analysis
determines a text’s emotional tone (positive / negative / neutral)
Code
# import software librariesimport nltk #nlpfrom nltk.sentiment.vader import SentimentIntensityAnalyzer # VADER sentiment analyzerimport os # enable engagement with directories and files on local machine# Initialize the VADER sentiment analyzersia = SentimentIntensityAnalyzer()# Function to calculate aggregate sentiment (positive / neutral / negative) for a collection of sentences# start counts at zerodef calculate_aggregate_sentiment(sentences): positive_score =0 negative_score =0 neutral_score =0 total_sentences =0# for each sentence, calculate sentiment polarity score: if score is positive, add positive score to positive score count; lastly keep a running count of the number of sentences analyzedfor sentence in sentences: sentiment = sia.polarity_scores(sentence) positive_score += sentiment['pos'] negative_score += sentiment['neg'] neutral_score += sentiment['neu'] total_sentences +=1# normalize scores by dividing polarity score by number of sentencesif total_sentences >0: avg_positive_score = positive_score / total_sentences avg_negative_score = negative_score / total_sentences avg_neutral_score = neutral_score / total_sentences# determine aggregate sentimentif avg_positive_score > avg_negative_score: overall_sentiment ="Positive"elif avg_positive_score < avg_negative_score: overall_sentiment ="Negative"else: overall_sentiment ="Neutral"return {"Total Sentences Analyzed": total_sentences,"Average Positive Score": avg_positive_score,"Average Negative Score": avg_negative_score,"Average Neutral Score": avg_neutral_score,"Overall Sentiment": overall_sentiment, }else:return {"No sentences with 're-' words found for analysis."}# Specify the directories you want to process with aliasescorpus_directories = {'/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP': "AC Swinburne",'/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP': "Thomas Hardy",'/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP': "Michael Field",'/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP': "DG Rossetti",# Add more directories here}# Process each directory; start by creating an empty list to contain sentences for each directory / corpusfor corpus_directory, alias in corpus_directories.items(): sentences = []# Read in the text files in the corpus and tokenize them into sentencesfor filename in os.listdir(corpus_directory):if filename.endswith('.txt'):withopen(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') asfile: text =file.read() sentences += nltk.sent_tokenize(text)# Filter sentences with words starting with "re-" sentences_with_re = [sentence for sentence in sentences ifany(word.lower().startswith("re-") for word in nltk.word_tokenize(sentence))]# Calculate aggregate sentiment for the filtered sentences results = calculate_aggregate_sentiment(sentences_with_re)# Print results for the current directoryprint(f"{alias}")for key, value in results.items():print(f"{key}: {value}")print()
Are “Re-” Words Used Positively or Negatively?
AC Swinburne
Total Sentences Analyzed: 15
Average Positive Score: 0.12393333333333335
Average Negative Score: 0.152
Average Neutral Score: 0.7242666666666665
Overall Sentiment: Negative
Thomas Hardy
Total Sentences Analyzed: 21
Average Positive Score: 0.08904761904761904
Average Negative Score: 0.11304761904761904
Average Neutral Score: 0.7979047619047619
Overall Sentiment: Negative
Michael Field
Total Sentences Analyzed: 8
Average Positive Score: 0.15262499999999998
Average Negative Score: 0.034374999999999996
Average Neutral Score: 0.812875
Overall Sentiment: Positive
DG Rossetti
Total Sentences Analyzed: 1
Average Positive Score: 0.082
Average Negative Score: 0.018
Average Neutral Score: 0.9
Overall Sentiment: Positive
Conclusion
Text mining can show:
morphemes are significant micro-elements of poetic style