Text Mining “Re-” in Victorian Poetry

Adam Mazel, Digital Publishing Librarian

Scholarly Communication Department, IUB Libraries

2023-11-11

Methodology

  • “Re-” 🤝 Text Mining
  • Python
    • Keyword Frequency
    • Key Word in Context (KWIC)
    • Term Frequency over Time
    • Sentiment Analysis
  • Exploratory Data Analysis

python programming language icon python scatterplot of recipe ingredient frequency before / after 1900

Data: Dante Gabriel Rossetti

  • Poems (1870)
  • Poems: A New Edition (1881)
  • Ballads and Sonnets (1881)

self-portrait of DG Rossetti

Self Portrait, 1861

Data: Algernon Charles Swinburne

  • Atalanta in Calydon (1865)
  • Poems and Ballads (1866)
  • Songs Before Sunrise (1871)
  • Songs of Two Nations (1875)
  • Erechtheus (1876)
  • Poems and Ballads, Second Series (1878)
  • Songs of the Springtides (1880)
  • Studies in Song (1880)
  • The Heptalogia, or the Seven against Sense. A Cap with Seven Bells (1880)
  • Tristram of Lyonesse (1882)
  • A Century of Roundels (1883)
  • A Midsummer Holiday and Other Poems (1884)
  • Poems and Ballads, Third Series (1889)
  • Astrophel and Other Poems (1894)
  • The Tale of Balen (1896)
  • A Channel Passage and Other Poems (1904)

Data: Michael Field

  • Long Ago (1889)
  • Sight and Song (1892)
  • Underneath the Bough (1893)
  • Wild Honey from Various Thyme (1908)
  • Poems of Adoration (1912)
  • Mystic Trees (1913)
  • Whym Chow: Flame of Love (1914)

photo of Katherine Bradley & Edith Cooper

Katherine Bradley & Edith Cooper, aka Michael Field

Data: Thomas Hardy

  • Wessex Poems and Other Verses (1898)
  • Poems of the Past and the Present (1901)
  • Time’s Laughingstocks and Other Verses (1909)
  • Satires of Circumstance (1914)
  • Moments of Vision (1917)
  • Late Lyrics and Earlier with Many Other Verses (1922)
  • Human Shows, Far Phantasies, Songs and Trifles (1925)

headshot of Thomas Hardy

Who Uses “Re-” Words More / Less?

  • Keyword Frequency Analysis
    • What percent of each poet’s corpus is composed of words with the prefix “re-”?
    • Compare “Re-” Word Percentages
Code
# import software libraries
import nltk
import os
import matplotlib.pyplot as plt

# Download NLTK tokenizer data
nltk.download('punkt') 

# Define a function that tokenizes a text and then counts how many of its words start with "re-"
def count_re_words(text):
    words = nltk.word_tokenize(text)
    return sum(1 for word in words if word.lower().startswith("re-"))

# Specify the directory paths for the two poets' corpora
corpus_directories = {
    'swinburne': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP',
    'hardy': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP',
    'michael field': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP',
    'dg rossetti': '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP',
}

# Initialize dictionary to store the results
percentage_re_words = {}

# create an empty list for each poet/directory
for poet, corpus_directory in corpus_directories.items():
    corpus = []
    
    # Read the text files in the poet's corpus and add them to the apt list above
    for filename in os.listdir(corpus_directory):
        with open(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') as file:
            text = file.read()
            corpus.append(text)

    # Apply fuction above to tokenize + count the "re-" words in each text of the poet's corpus and store count
    re_word_count = sum(count_re_words(text) for text in corpus)

    # Calculate the corpus word count by taking each text, tokenizing it, counting the number of tokens, and then adding them to create a total word count 
    total_words = sum(len(nltk.word_tokenize(text)) for text in corpus)
    # Calculate the percentage of "re-" words in the poet's corpus by dividing the re- word count by the total word count and multiply by 100
    percentage_re_words[poet] = (re_word_count / total_words) * 100

# Sort the results from largest to smallest
sorted_results = sorted(percentage_re_words.items(), key=lambda x: x[1], reverse=True)

# Extract poets and percentages for plotting
poets, percentages = zip(*sorted_results)

# Step 4: visualize the results in a bar chart 
plt.figure(figsize=(8, 6))
plt.bar(poets, percentages, color=['blue', 'orange'])
plt.ylabel('Percentage (%)')
plt.title('Whose Poetry is More Composed of Words that Start with "Re-"?')

# Set the y-axis limit based on the largest percentage
ylim_percentage = max(percentages) * 2  # Adjusted for better visualization
plt.ylim(0, ylim_percentage)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the bar chart with the poet's names angled to avoid overlap
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Who Uses “Re-” Words More / Less?

Which “Re-” Words Are Most Frequent?

  • Keyword Frequency Analysis
    • Which “re-” words are used and how often?
Code
# import software libraries / dependencies
import nltk
import os
import re
import matplotlib.pyplot as plt
from collections import Counter
from nltk.stem import SnowballStemmer

# Download NLTK tokenizer data
nltk.download('punkt')  

# Initialize Stemmer for English
stemmer = SnowballStemmer("english")  

# create function to preprocess and process files of each directory: create an empty list for each directory/poet
def process_directory(corpus_directory):   
    corpus = []

    # Get the directory (poet's) name to use as the label
    label = os.path.basename(corpus_directory)

    # Step 1: Read the text of corpus from files and add it to apt (poet's) list, created above
    for filename in os.listdir(corpus_directory):
        with open(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') as file:
            text = file.read()
            corpus.append(text)

    # Step 2: Tokenize each text into individual words
    tokenized_corpus = [nltk.word_tokenize(text) for text in corpus]

    # Step 3: find words that when lower cased start with re-, standardize (lower) their case, stem them, retain them
    stemmed_corpus = []
    for tokens in tokenized_corpus:
        stemmed_tokens = [stemmer.stem(word.lower()) for word in tokens if re.match(r'\b(re-)\w+', word.lower())]
        stemmed_corpus.append(stemmed_tokens)

    # Step 6: Count the frequency of each re- word
    word_counts = Counter(word for tokens in stemmed_corpus for word in tokens)

    # Step 7: Display the most frequent words
    # most_common_re_words = word_counts.most_common(20)  # Set the desired number of top words
    # for word, count in most_common_re_words:
    #    print(f'The poetry of {label[:-5]} uses {word}: {count} times')

    # Step 8: Plot top 20 most frequent words and counts in a bar chart, use lable to title plot
    plt.figure(figsize=(10, 5))
    top_words, top_counts = zip(*word_counts.most_common(20))
    plt.bar(top_words, top_counts)
    plt.title(f'Frequency of Stemmed Re- Words in {label[:-5].capitalize()}\'s Poetry')
    plt.xticks(rotation=65)
    plt.show()

# Directories to process
corpus_directories = [
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP',
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP',
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP',
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP',
    # Add more directories here
]

# Process each directory
for directory in corpus_directories:
    process_directory(directory)

Which “Re-” Words Are Most Frequent?

Field’s and Hardy’s “Re-Illume” in Context

  • Re-illume
    • Extremely rare
    • Chiefly poetic
    • 1758 – present
  • Field: 1x
  • Hardy: 2x
    • “Two Rosalinds”: Time’s Laughingstocks and Other Verses (1909)
    • “For Life I had never cared greatly”: Moments of Vision and Miscellaneous Verses (1917)

Do “Re-” Words and “Re” Words Co-occur in Hardy’s Poetry?

  • Key (Re-) Word in Context (KWIC)
  • Have Python return Hardy’s sentences that contain word(s) that start with “re-” and word(s) that start with “re”
Code
# import libraries
import os # open directories on local machine
import nltk # nlp
import re # reg ex
from nltk.tokenize import sent_tokenize, word_tokenize # break text blob into individual sentences and words

# Define regular expressions
re_pattern = r'\b(?:[Rr]e|[Rr]E)\w+\b'  # Matches "re" or "Re" or "rE" or "RE" followed by one or more letters (not hyphen)
re_hyphen_pattern = r'\b(?:[Rr]e-|[Rr]E-)\w+\b'  # Matches "re-" or "Re-" or "rE-" or "RE-" followed by one or more letters
# \b matches a word boundary.
# (?:[Rr]e|[Rr]E) is a non-capturing group that matches either "re" or "Re" or "rE" or "RE" 
# \w+ matches one or more word characters following "re" or "Re" or "rE" or "RE" 

def matches_pattern(word, pattern):
    return re.search(pattern, word, re.IGNORECASE) is not None

# Define function to surround matched words with asterisks
def bold_matched_words(match, pattern):
    return f'**{match.group()}**'

# Directory containing text files
text_directory = "/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP"

# Iterate through text files in the directory, open text files, read in text, break up text into sentences, break up sentences into words, id those sentences with re/re- words, if sentence has re/re-, add asterisks around re-/re words and print sentence
for filename in os.listdir(text_directory):
    if filename.endswith(".txt"):
        with open(os.path.join(text_directory, filename), "r", encoding="utf-8") as file:
            text = file.read()
            sentences = sent_tokenize(text)
            
            for sentence in sentences:
                words = word_tokenize(sentence)
                has_re = any(matches_pattern(word, re_pattern) for word in words)
                has_re_hyphen = any(matches_pattern(word, re_hyphen_pattern) for word in words)
                
                if has_re and has_re_hyphen:
                    sentence_with_bold = re.sub(re_pattern, lambda x: bold_matched_words(x, re_pattern), sentence)
                    sentence_with_bold = re.sub(re_hyphen_pattern, lambda x: bold_matched_words(x, re_hyphen_pattern), sentence_with_bold)
                    print(sentence_with_bold)
                    print("\n" * 2)

Do “Re-” Words and “Re” Words Co-occur in Hardy’s Poetry?

Within there
   Too mocking to Love’s **re-expression**
      Was Time’s **repartee**!



IX

      “The words, sir?” cried a creature
   Hovering mid the shine and shade as ’twixt the live world and the
   tomb;
   But the well-known numbers needed not for me a text or teacher
      To **revive** and **re-illume**.



THE FLIRT’S TRAGEDY
(17–)


   HERE alone by the logs in my chamber,
      Deserted, decrepit—
   Spent flames limning ghosts on the wainscot
      Of friends I once knew—

   My drama and hers begins weirdly
      Its dumb **re-enactment**,
   Each scene, sigh, and circumstance passing
      In spectral **review**.



I take my holiday then and my **rest**
   Away from the dun life here about me,
            Old hours **re-greeting**
         With the quiet sense that bring they must
         Such throbs as at first, till I house with dust,
         And in the numbness my heartsome zest
            For things that were, be past **repeating**
               When spring comes round.



And so, the rough highway forgetting,
         I pace hill and dale
         **Regarding** the sky,
      **Regarding** the vision on high,
   And thus **re-illumed** have no humour for letting
         My pilgrimage fail.



So I don’t want to linger in this **re-decked** dwelling,
   I feel too uneasy at the contrasts I behold,
   And I make again for Mellstock to **return** here never,
   And **rejoin** the roomy silence, and the mute and manifold
      Souls of old.



. .
   Anon I shall break it and bury its fragments
      Where my grave is to be.”



THE **RE-ENACTMENT**


      BETWEEN the folding sea-downs,
         In the gloom
      Of a wailful wintry nightfall,
         When the boom
   Of the ocean, like a hammering in a hollow tomb,

      Throbbed up the copse-clothed valley
         From the shore
      To the chamber where I darkled,
         Sunk and sore
   With gray ponderings why my Loved one had not come before

      To salute me in the dwelling
         That of late
      I had hired to waste a while in—
         Vague of date,
   Quaint, and **remote**—wherein I now expectant sate;

      On the solitude, unsignalled,
         Broke a man
      Who, in air as if at home there,
         Seemed to scan
   Every fire-flecked nook of the apartment span by span.


When Are “Re-” Words More / Less Frequent?

  • Time Series / Term Frequency over Time
    • Y axis: re- word percentage of book
    • X axis: publication year of book
Code
# Import software libraries
import nltk # nlp
import os # interact with directories on local machine
import re # reg ex
import matplotlib.pyplot as plt # data visualization
import string # punctuation removal

# Download NLTK tokenizer data
nltk.download('punkt')  

# Define a function that takes each text, tokenizes it into individual words, and count words that when lowercased start with "re-"
def count_re_words(text):
    words = nltk.word_tokenize(text)
    return sum(1 for word in words if word.lower().startswith("re-"))

# Define a function to remove all punctuation except hyphens
def remove_punctuation_except_hyphens(text):
    translator = str.maketrans('', '', string.punctuation.replace('-', ''))
    return text.translate(translator)

# Specify the parent directory containing multiple text file directories
parent_directory = '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/'

# Soecific irectories to process
all_directory_names = [
    'swinburne/swinburne_noBP',
    'hardy/hardy_noBP',
    'field/field_NoBP',
    # 'rossetti_dg/rossetti_dg_NoBP',
]

# Initialize a dictionary to store year-wise percentages
year_percentages = {}

# Process each directory
for directory_name in all_directory_names:
    # Construct the full path to the current directory
    text_files_directory = os.path.join(parent_directory, directory_name)

    # Initialize two dictionaries: one to store year-wise percentages for the current directory and one to store file names for the current directory
    if os.path.isdir(text_files_directory):
        directory_percentages = {}
        file_names = {}

        # Process each text file in the current directory
        for filename in os.listdir(text_files_directory):
            if filename.endswith('.txt'):
                # Extract the year from the filename using a regular expression
                year_match = re.match(r'(\d{4})_', filename)
                if year_match:
                    year = int(year_match.group(1))
                    # read in text of each book of poetry
                    with open(os.path.join(text_files_directory, filename), 'r', encoding='utf-8') as file:
                        text = file.read()
                        # Remove punctuation except hyphens
                        text = remove_punctuation_except_hyphens(text)
                        # tokenize text, count the total number of words, count the number of re- words, normalize counts
                        total_words = len(nltk.word_tokenize(text))
                        re_word_count = count_re_words(text)
                        percentage_re_words = (re_word_count / total_words) * 100
                        
                        #add to dictionary: value: count, key: publication year
                        directory_percentages[year] = percentage_re_words
                        
                        # Extract the text between underscores in the filename
                        file_name_parts = filename.split('_')
                        if len(file_name_parts) > 2:
                            file_name = '_'.join(file_name_parts[1:-1])
                        else:
                            file_name = file_name_parts[1]
                        
                        #add to dictionary: value: filename, key: years
                        file_names[year] = file_name  # Store the extracted file name

        # Sort the dictionary by keys (years) for the current directory
        sorted_directory_percentages = {year: directory_percentages[year] for year in sorted(directory_percentages)}

        # Store the results for the current directory
        year_percentages[directory_name] = {
            'percentages': sorted_directory_percentages,
            'file_names': file_names  # Store file names for this directory
        }

# Step 2: Plot the keys (years) and values (percentages) in a line graph for each directory
plt.figure(figsize=(10, 8))

for directory_name, data in year_percentages.items():
    percentages = data['percentages']
    file_names = data['file_names']
    years = list(percentages.keys())
    percentages = list(percentages.values())
    
    # Plot the data points
    plt.plot(years, percentages, marker='o', linestyle='-', label=directory_name[:-5])
    
    # Add annotations for each data point if the value is greater than 0
    for year, percentage in zip(years, percentages):
        if percentage > 0:
            file_name = file_names[year]
            annotation = f"{file_name}"
            plt.annotate(annotation, (year, percentage), textcoords="offset points", xytext=(0, 10), ha='center')

plt.ylim(0, 0.0525)
plt.xlabel('Year')
plt.ylabel('Percentage of Words Starting with "Re-"')
plt.title('When Are "Re-" Words Used in Field\'s, Hardy\'s, and Swinburne\'s Poetry?')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()  # Show legend indicating directory names

# Display the line graph
plt.tight_layout()
plt.show()

When Are “Re-” Words More / Less Frequent?

Are “Re-” Words Used Positively or Negatively?

  • Sentiment Analysis
    • determines a text’s emotional tone (positive / negative / neutral)
Code
# import software libraries
import nltk #nlp
from nltk.sentiment.vader import SentimentIntensityAnalyzer # VADER sentiment analyzer
import os # enable engagement with directories and files on local machine

# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Function to calculate aggregate sentiment (positive / neutral / negative) for a collection of sentences
# start counts at zero
def calculate_aggregate_sentiment(sentences):
    positive_score = 0
    negative_score = 0
    neutral_score = 0
    total_sentences = 0

# for each sentence, calculate sentiment polarity score: if score is positive, add positive score to positive score count; lastly keep a running count of the number of sentences analyzed
    for sentence in sentences:
        sentiment = sia.polarity_scores(sentence)
        positive_score += sentiment['pos']
        negative_score += sentiment['neg']
        neutral_score += sentiment['neu']
        total_sentences += 1
# normalize scores by dividing polarity score by number of sentences
    if total_sentences > 0:
        avg_positive_score = positive_score / total_sentences
        avg_negative_score = negative_score / total_sentences
        avg_neutral_score = neutral_score / total_sentences
# determine aggregate sentiment
        if avg_positive_score > avg_negative_score:
            overall_sentiment = "Positive"
        elif avg_positive_score < avg_negative_score:
            overall_sentiment = "Negative"
        else:
            overall_sentiment = "Neutral"

        return {
            "Total Sentences Analyzed": total_sentences,
            "Average Positive Score": avg_positive_score,
            "Average Negative Score": avg_negative_score,
            "Average Neutral Score": avg_neutral_score,
            "Overall Sentiment": overall_sentiment,
        }
    else:
        return {"No sentences with 're-' words found for analysis."}

# Specify the directories you want to process with aliases
corpus_directories = {
 
        '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/swinburne/swinburne_noBP': "AC Swinburne",
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/hardy/hardy_noBP': "Thomas Hardy",
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/field/field_NoBP': "Michael Field",
    '/home/adammazel/Documents/Digital_Scholarship/re-victorian-poetry/cta/rossetti_dg/rossetti_dg_NoBP': "DG Rossetti",
       # Add more directories here
}

# Process each directory; start by creating an empty list to contain sentences for each directory / corpus
for corpus_directory, alias in corpus_directories.items():
    sentences = []

    # Read in the text files in the corpus and tokenize them into sentences
    for filename in os.listdir(corpus_directory):
        if filename.endswith('.txt'):
            with open(os.path.join(corpus_directory, filename), 'r', encoding='utf-8') as file:
                text = file.read()
                sentences += nltk.sent_tokenize(text)

    # Filter sentences with words starting with "re-"
    sentences_with_re = [sentence for sentence in sentences if any(word.lower().startswith("re-") for word in nltk.word_tokenize(sentence))]

    # Calculate aggregate sentiment for the filtered sentences
    results = calculate_aggregate_sentiment(sentences_with_re)

    # Print results for the current directory
    print(f"{alias}")
    for key, value in results.items():
        print(f"{key}: {value}")
    print()

Are “Re-” Words Used Positively or Negatively?

AC Swinburne
Total Sentences Analyzed: 15
Average Positive Score: 0.12393333333333335
Average Negative Score: 0.152
Average Neutral Score: 0.7242666666666665
Overall Sentiment: Negative

Thomas Hardy
Total Sentences Analyzed: 21
Average Positive Score: 0.08904761904761904
Average Negative Score: 0.11304761904761904
Average Neutral Score: 0.7979047619047619
Overall Sentiment: Negative

Michael Field
Total Sentences Analyzed: 8
Average Positive Score: 0.15262499999999998
Average Negative Score: 0.034374999999999996
Average Neutral Score: 0.812875
Overall Sentiment: Positive

DG Rossetti
Total Sentences Analyzed: 1
Average Positive Score: 0.082
Average Negative Score: 0.018
Average Neutral Score: 0.9
Overall Sentiment: Positive

Conclusion

  • Text mining can show:
    • morphemes are significant micro-elements of poetic style
    • how a poem’s language is shaped by form and genre
    • word’s history can shape its meaning
  • Distant Reading 🤝 Close Reading