Text Mining “Re-” in Victorian Poetry

Adam Mazel, Digital Publishing Librarian

Scholarly Communication Department, IUB Libraries

2023-11-11

Methodology

“Re-” 🤝 Text Mining
Python
- Keyword Frequency
- Key Word in Context (KWIC)
- Term Frequency over Time
- Sentiment Analysis
Exploratory Data Analysis

python programming language icon python scatterplot of recipe ingredient frequency before / after 1900

Data: Dante Gabriel Rossetti

Poems (1870)
Poems: A New Edition (1881)
Ballads and Sonnets (1881)

self-portrait of DG Rossetti — Self Portrait, 1861

Data: Algernon Charles Swinburne

Atalanta in Calydon (1865)
Poems and Ballads (1866)
Songs Before Sunrise (1871)
Songs of Two Nations (1875)
Erechtheus (1876)
Poems and Ballads, Second Series (1878)
Songs of the Springtides (1880)
Studies in Song (1880)

The Heptalogia, or the Seven against Sense. A Cap with Seven Bells (1880)
Tristram of Lyonesse (1882)
A Century of Roundels (1883)
A Midsummer Holiday and Other Poems (1884)
Poems and Ballads, Third Series (1889)
Astrophel and Other Poems (1894)
The Tale of Balen (1896)
A Channel Passage and Other Poems (1904)

Data: Michael Field

Long Ago (1889)
Sight and Song (1892)
Underneath the Bough (1893)
Wild Honey from Various Thyme (1908)
Poems of Adoration (1912)
Mystic Trees (1913)
Whym Chow: Flame of Love (1914)

photo of Katherine Bradley & Edith Cooper — Katherine Bradley & Edith Cooper, aka Michael Field

Data: Thomas Hardy

Wessex Poems and Other Verses (1898)
Poems of the Past and the Present (1901)
Time’s Laughingstocks and Other Verses (1909)
Satires of Circumstance (1914)
Moments of Vision (1917)
Late Lyrics and Earlier with Many Other Verses (1922)
Human Shows, Far Phantasies, Songs and Trifles (1925)

headshot of Thomas Hardy

Who Uses “Re-” Words More / Less?

Keyword Frequency Analysis
- What percent of each poet’s corpus is composed of words with the prefix “re-”?
- Compare “Re-” Word Percentages

Code

# import software libraries
import nltk
import os
import matplotlib.pyplot as plt

# Download NLTK tokenizer data
nltk.download('punkt') 

# Define a function that tokenizes a text and then counts how many of its words start with "re-"
def count_re_words(text):
    words = nltk.word_tokenize(text)
    return sum(1 for word in words if word.lower().startswith("re-"))

# Specify the directory paths for the two poets' corpora
corpus_directories = {
    'swinburne': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP',
    'hardy': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP',
    'michael field': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP',
    'dg rossetti': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP',
}

# Initialize dictionary to store the results
percentage_re_words = {}

# create an empty list for each poet/directory
for poet, corpus_directory in corpus_directories.items():
    corpus = []
    
    # Read the text files in the poet's corpus and add them to the apt list above
    for filename in os.listdir(corpus_directory):
        with open(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') as file:
            text = file.read()
            corpus.append(text)

    # Apply fuction above to tokenize + count the "re-" words in each text of the poet's corpus and store count
    re_word_count = sum(count_re_words(text) for text in corpus)

    # Calculate the corpus word count by taking each text, tokenizing it, counting the number of tokens, and then adding them to create a total word count 
    total_words = sum(len(nltk.word_tokenize(text)) for text in corpus)
    # Calculate the percentage of "re-" words in the poet's corpus by dividing the re- word count by the total word count and multiply by 100
    percentage_re_words[poet] = (re_word_count / total_words) * 100

# Sort the results from largest to smallest
sorted_results = sorted(percentage_re_words.items(), key=lambda x: x[1], reverse=True)

# Extract poets and percentages for plotting
poets, percentages = zip(*sorted_results)

# Step 4: visualize the results in a bar chart 
plt.figure(figsize=(8, 6))
plt.bar(poets, percentages, color=['blue', 'orange'])
plt.ylabel('Percentage (%)')
plt.title('Whose Poetry is More Composed of Words that Start with "Re-"?')

# Set the y-axis limit based on the largest percentage
ylim_percentage = max(percentages) * 2  # Adjusted for better visualization
plt.ylim(0, ylim_percentage)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the bar chart with the poet's names angled to avoid overlap
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Who Uses “Re-” Words More / Less?

Which “Re-” Words Are Most Frequent?

Keyword Frequency Analysis
- Which “re-” words are used and how often?

Code

# import software libraries / dependencies
import nltk
import os
import re
import matplotlib.pyplot as plt
from collections import Counter
from nltk.stem import SnowballStemmer

# Download NLTK tokenizer data
nltk.download('punkt')  

# Initialize Stemmer for English
stemmer = SnowballStemmer("english")  

# create function to preprocess and process files of each directory: create an empty list for each directory/poet
def process_directory(corpus_directory):   
    corpus = []

    # Get the directory (poet's) name to use as the label
    label = os.path.basename(corpus_directory)

    # Step 1: Read the text of corpus from files and add it to apt (poet's) list, created above
    for filename in os.listdir(corpus_directory):
        with open(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') as file:
            text = file.read()
            corpus.append(text)

    # Step 2: Tokenize each text into individual words
    tokenized_corpus = [nltk.word_tokenize(text) for text in corpus]

    # Step 3: find words that when lower cased start with re-, standardize (lower) their case, stem them, retain them
    stemmed_corpus = []
    for tokens in tokenized_corpus:
        stemmed_tokens = [stemmer.stem(word.lower()) for word in tokens if re.match(r'\b(re-)\w+', word.lower())]
        stemmed_corpus.append(stemmed_tokens)

    # Step 6: Count the frequency of each re- word
    word_counts = Counter(word for tokens in stemmed_corpus for word in tokens)

    # Step 7: Display the most frequent words
    # most_common_re_words = word_counts.most_common(20)  # Set the desired number of top words
    # for word, count in most_common_re_words:
    #    print(f'The poetry of {label[:-5]} uses {word}: {count} times')

    # Step 8: Plot top 20 most frequent words and counts in a bar chart, use lable to title plot
    plt.figure(figsize=(10, 5))
    top_words, top_counts = zip(*word_counts.most_common(20))
    plt.bar(top_words, top_counts)
    plt.title(f'Frequency of Stemmed Re- Words in {label[:-5].capitalize()}\'s Poetry')
    plt.xticks(rotation=65)
    plt.show()

# Directories to process
corpus_directories = [
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP',
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP',
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP',
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP',
    # Add more directories here
]

# Process each directory
for directory in corpus_directories:
    process_directory(directory)

Which “Re-” Words Are Most Frequent?

Field’s and Hardy’s “Re-Illume” in Context

Re-illume
- Extremely rare
- Chiefly poetic
- 1758 – present
Field: 1x
- “XXII: Sleeping together: Sleep”: Whym Chow: Flame of Love (1906, 1914)
Hardy: 2x
- “Two Rosalinds”: Time’s Laughingstocks and Other Verses (1909)
- “For Life I had never cared greatly”: Moments of Vision and Miscellaneous Verses (1917)

Do “Re-” Words and “Re” Words Co-occur in Hardy’s Poetry?

Key (Re-) Word in Context (KWIC)
Have Python return Hardy’s sentences that contain word(s) that start with “re-” and word(s) that start with “re”

Code

# import libraries
import os # open directories on local machine
import nltk # nlp
import re # reg ex
from nltk.tokenize import sent_tokenize, word_tokenize # break text blob into individual sentences and words

# Define regular expressions
re_pattern = r'\b(?:[Rr]e|[Rr]E)\w+\b'  # Matches "re" or "Re" or "rE" or "RE" followed by one or more letters (not hyphen)
re_hyphen_pattern = r'\b(?:[Rr]e-|[Rr]E-)\w+\b'  # Matches "re-" or "Re-" or "rE-" or "RE-" followed by one or more letters
# \b matches a word boundary.
# (?:[Rr]e|[Rr]E) is a non-capturing group that matches either "re" or "Re" or "rE" or "RE" 
# \w+ matches one or more word characters following "re" or "Re" or "rE" or "RE" 

def matches_pattern(word, pattern):
    return re.search(pattern, word, re.IGNORECASE) is not None

# Define function to surround matched words with asterisks
def bold_matched_words(match, pattern):
    return f'**{match.group()}**'

# Directory containing text files
text_directory = "/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP"

# Iterate through text files in the directory, open text files, read in text, break up text into sentences, break up sentences into words, id those sentences with re/re- words, if sentence has re/re-, add asterisks around re-/re words and print sentence
for filename in os.listdir(text_directory):
    if filename.endswith(".txt"):
        with open(os.path.join(text_directory, filename), "r", encoding="utf-8") as file:
            text = file.read()
            sentences = sent_tokenize(text)
            
            for sentence in sentences:
                words = word_tokenize(sentence)
                has_re = any(matches_pattern(word, re_pattern) for word in words)
                has_re_hyphen = any(matches_pattern(word, re_hyphen_pattern) for word in words)
                
                if has_re and has_re_hyphen:
                    sentence_with_bold = re.sub(re_pattern, lambda x: bold_matched_words(x, re_pattern), sentence)
                    sentence_with_bold = re.sub(re_hyphen_pattern, lambda x: bold_matched_words(x, re_hyphen_pattern), sentence_with_bold)
                    print(sentence_with_bold)
                    print("\n" * 2)

Do “Re-” Words and “Re” Words Co-occur in Hardy’s Poetry?

Within there
   Too mocking to Love’s **re-expression**
      Was Time’s **repartee**!



IX

      “The words, sir?” cried a creature
   Hovering mid the shine and shade as ’twixt the live world and the
   tomb;
   But the well-known numbers needed not for me a text or teacher
      To **revive** and **re-illume**.



THE FLIRT’S TRAGEDY
(17–)


   HERE alone by the logs in my chamber,
      Deserted, decrepit—
   Spent flames limning ghosts on the wainscot
      Of friends I once knew—

   My drama and hers begins weirdly
      Its dumb **re-enactment**,
   Each scene, sigh, and circumstance passing
      In spectral **review**.



I take my holiday then and my **rest**
   Away from the dun life here about me,
            Old hours **re-greeting**
         With the quiet sense that bring they must
         Such throbs as at first, till I house with dust,
         And in the numbness my heartsome zest
            For things that were, be past **repeating**
               When spring comes round.



And so, the rough highway forgetting,
         I pace hill and dale
         **Regarding** the sky,
      **Regarding** the vision on high,
   And thus **re-illumed** have no humour for letting
         My pilgrimage fail.



So I don’t want to linger in this **re-decked** dwelling,
   I feel too uneasy at the contrasts I behold,
   And I make again for Mellstock to **return** here never,
   And **rejoin** the roomy silence, and the mute and manifold
      Souls of old.



. .
   Anon I shall break it and bury its fragments
      Where my grave is to be.”



THE **RE-ENACTMENT**


      BETWEEN the folding sea-downs,
         In the gloom
      Of a wailful wintry nightfall,
         When the boom
   Of the ocean, like a hammering in a hollow tomb,

      Throbbed up the copse-clothed valley
         From the shore
      To the chamber where I darkled,
         Sunk and sore
   With gray ponderings why my Loved one had not come before

      To salute me in the dwelling
         That of late
      I had hired to waste a while in—
         Vague of date,
   Quaint, and **remote**—wherein I now expectant sate;

      On the solitude, unsignalled,
         Broke a man
      Who, in air as if at home there,
         Seemed to scan
   Every fire-flecked nook of the apartment span by span.

When Are “Re-” Words More / Less Frequent?

Time Series / Term Frequency over Time
- Y axis: re- word percentage of book
- X axis: publication year of book

Code

# Import software libraries
import nltk # nlp
import os # interact with directories on local machine
import re # reg ex
import matplotlib.pyplot as plt # data visualization
import string # punctuation removal

# Download NLTK tokenizer data
nltk.download('punkt')  

# Define a function that takes each text, tokenizes it into individual words, and count words that when lowercased start with "re-"
def count_re_words(text):
    words = nltk.word_tokenize(text)
    return sum(1 for word in words if word.lower().startswith("re-"))

# Define a function to remove all punctuation except hyphens
def remove_punctuation_except_hyphens(text):
    translator = str.maketrans('', '', string.punctuation.replace('-', ''))
    return text.translate(translator)

# Specify the parent directory containing multiple text file directories
parent_directory = '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/'

# Soecific irectories to process
all_directory_names = [
    'swinburne/swinburne_noBP',
    'hardy/hardy_noBP',
    'field/field_NoBP',
    # 'rossetti_dg/rossetti_dg_NoBP',
]

# Initialize a dictionary to store year-wise percentages
year_percentages = {}

# Process each directory
for directory_name in all_directory_names:
    # Construct the full path to the current directory
    text_files_directory = os.path.join(parent_directory, directory_name)

    # Initialize two dictionaries: one to store year-wise percentages for the current directory and one to store file names for the current directory
    if os.path.isdir(text_files_directory):
        directory_percentages = {}
        file_names = {}

        # Process each text file in the current directory
        for filename in os.listdir(text_files_directory):
            if filename.endswith('.txt'):
                # Extract the year from the filename using a regular expression
                year_match = re.match(r'(\d{4})_', filename)
                if year_match:
                    year = int(year_match.group(1))
                    # read in text of each book of poetry
                    with open(os.path.join(text_files_directory, filename), 'r', encoding='utf-8') as file:
                        text = file.read()
                        # Remove punctuation except hyphens
                        text = remove_punctuation_except_hyphens(text)
                        # tokenize text, count the total number of words, count the number of re- words, normalize counts
                        total_words = len(nltk.word_tokenize(text))
                        re_word_count = count_re_words(text)
                        percentage_re_words = (re_word_count / total_words) * 100
                        
                        #add to dictionary: value: count, key: publication year
                        directory_percentages[year] = percentage_re_words
                        
                        # Extract the text between underscores in the filename
                        file_name_parts = filename.split('_')
                        if len(file_name_parts) > 2:
                            file_name = '_'.join(file_name_parts[1:-1])
                        else:
                            file_name = file_name_parts[1]
                        
                        #add to dictionary: value: filename, key: years
                        file_names[year] = file_name  # Store the extracted file name

        # Sort the dictionary by keys (years) for the current directory
        sorted_directory_percentages = {year: directory_percentages[year] for year in sorted(directory_percentages)}

        # Store the results for the current directory
        year_percentages[directory_name] = {
            'percentages': sorted_directory_percentages,
            'file_names': file_names  # Store file names for this directory
        }

# Step 2: Plot the keys (years) and values (percentages) in a line graph for each directory
plt.figure(figsize=(10, 8))

for directory_name, data in year_percentages.items():
    percentages = data['percentages']
    file_names = data['file_names']
    years = list(percentages.keys())
    percentages = list(percentages.values())
    
    # Plot the data points
    plt.plot(years, percentages, marker='o', linestyle='-', label=directory_name[:-5])
    
    # Add annotations for each data point if the value is greater than 0
    for year, percentage in zip(years, percentages):
        if percentage > 0:
            file_name = file_names[year]
            annotation = f"{file_name}"
            plt.annotate(annotation, (year, percentage), textcoords="offset points", xytext=(0, 10), ha='center')

plt.ylim(0, 0.0525)
plt.xlabel('Year')
plt.ylabel('Percentage of Words Starting with "Re-"')
plt.title('When Are "Re-" Words Used in Field\'s, Hardy\'s, and Swinburne\'s Poetry?')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()  # Show legend indicating directory names

# Display the line graph
plt.tight_layout()
plt.show()

When Are “Re-” Words More / Less Frequent?

Are “Re-” Words Used Positively or Negatively?

Sentiment Analysis
- determines a text’s emotional tone (positive / negative / neutral)

Code

# import software libraries
import nltk #nlp
from nltk.sentiment.vader import SentimentIntensityAnalyzer # VADER sentiment analyzer
import os # enable engagement with directories and files on local machine

# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Function to calculate aggregate sentiment (positive / neutral / negative) for a collection of sentences
# start counts at zero
def calculate_aggregate_sentiment(sentences):
    positive_score = 0
    negative_score = 0
    neutral_score = 0
    total_sentences = 0

# for each sentence, calculate sentiment polarity score: if score is positive, add positive score to positive score count; lastly keep a running count of the number of sentences analyzed
    for sentence in sentences:
        sentiment = sia.polarity_scores(sentence)
        positive_score += sentiment['pos']
        negative_score += sentiment['neg']
        neutral_score += sentiment['neu']
        total_sentences += 1
# normalize scores by dividing polarity score by number of sentences
    if total_sentences > 0:
        avg_positive_score = positive_score / total_sentences
        avg_negative_score = negative_score / total_sentences
        avg_neutral_score = neutral_score / total_sentences
# determine aggregate sentiment
        if avg_positive_score > avg_negative_score:
            overall_sentiment = "Positive"
        elif avg_positive_score < avg_negative_score:
            overall_sentiment = "Negative"
        else:
            overall_sentiment = "Neutral"

        return {
            "Total Sentences Analyzed": total_sentences,
            "Average Positive Score": avg_positive_score,
            "Average Negative Score": avg_negative_score,
            "Average Neutral Score": avg_neutral_score,
            "Overall Sentiment": overall_sentiment,
        }
    else:
        return {"No sentences with 're-' words found for analysis."}

# Specify the directories you want to process with aliases
corpus_directories = {
 
        '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP': "AC Swinburne",
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP': "Thomas Hardy",
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP': "Michael Field",
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP': "DG Rossetti",
       # Add more directories here
}

# Process each directory; start by creating an empty list to contain sentences for each directory / corpus
for corpus_directory, alias in corpus_directories.items():
    sentences = []

    # Read in the text files in the corpus and tokenize them into sentences
    for filename in os.listdir(corpus_directory):
        if filename.endswith('.txt'):
            with open(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') as file:
                text = file.read()
                sentences += nltk.sent_tokenize(text)

    # Filter sentences with words starting with "re-"
    sentences_with_re = [sentence for sentence in sentences if any(word.lower().startswith("re-") for word in nltk.word_tokenize(sentence))]

    # Calculate aggregate sentiment for the filtered sentences
    results = calculate_aggregate_sentiment(sentences_with_re)

    # Print results for the current directory
    print(f"{alias}")
    for key, value in results.items():
        print(f"{key}: {value}")
    print()

Are “Re-” Words Used Positively or Negatively?

AC Swinburne
Total Sentences Analyzed: 15
Average Positive Score: 0.12393333333333335
Average Negative Score: 0.152
Average Neutral Score: 0.7242666666666665
Overall Sentiment: Negative

Thomas Hardy
Total Sentences Analyzed: 21
Average Positive Score: 0.08904761904761904
Average Negative Score: 0.11304761904761904
Average Neutral Score: 0.7979047619047619
Overall Sentiment: Negative

Michael Field
Total Sentences Analyzed: 8
Average Positive Score: 0.15262499999999998
Average Negative Score: 0.034374999999999996
Average Neutral Score: 0.812875
Overall Sentiment: Positive

DG Rossetti
Total Sentences Analyzed: 1
Average Positive Score: 0.082
Average Negative Score: 0.018
Average Neutral Score: 0.9
Overall Sentiment: Positive

Conclusion

Text mining can show:
- morphemes are significant micro-elements of poetic style
- how a poem’s language is shaped by form and genre
- word’s history can shape its meaning
Distant Reading 🤝 Close Reading

Thank You!
Contact Info
- Adam Mazel: amazel@iu.edu