Natural Language Processing (NLP) in textual genres analysis

Practical project using NLTK

Bruno Carloto
15 min readFeb 22, 2022

Natural Language Processing (NLP) is a subfield of Artificial Intelligence. What we can make and do from it is related to the communication between men and machines. It occurs that this subfield is increasingly participating in the world of the Digital Age, then we increasingly interact with machines that understand our natural language. For example, we have Google Translator, Alexa, various virtual assistants, word corrector, spam classifier, smart summarizer, etc. There are many examples and possibilities already implemented and under development.

One of the possibilities concerns the application of NPL on textual genres analysis. We can have a lot of answers related to the characteristics of a textual genre, such as the most common words, the average size of the textual genre, how many the words are repeated on average, among others. Analysis is not the only possibility, but, it is possible to create artificial intelligence to classify texts in their respective genres. It is so usuful for search engines; smart educational games, in which there is an artificial intelligence, at the back of an app, manages the learning of students and suggests text for students classify; the classification itself; etc.

Considering these usuful possibilities, I developed this practical project that aims to analyze four texts from different textual genres, more precisely textual subgenres which are: fiction, mystery, religion, and romance. The analysis aims to answer the seven following questions:

1 — What is the size of the text?

2 — How many are there distinct words in the text?

3 — How many times does a word appear on average in the text?

4 — In terms of percentage, how many times does a word appear on average in the text?

5 — Which text has on average the most frequent words and the largest size?

6 — Which text has on average the least frequent words and the smallest size?

7 — What is the context in which three of the most frequent words are?

Along the article, all the questions are answered. You can verify them, therefore, pay attention.

Major division of the content

  • Introduction
  • First step: Some statistics and the sizes of the texts
  • Second step: Frequency of words and some statistics
  • Third step: Knowing some words in context
  • Final considerations

#Practice

1 First step: Some statistics and the size of the texts

I initially import the base libraries and especific methods and expose the textual genre categories.

#Importing base libraries
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
#Importing specific method
from nltk.corpus import brown
from wordcloud import WordCloud
#Displaying the textual genre categories
print(brown.categories()), print('\n Number of categories: {}'.format(len(brown.categories())))

Then I store the selected textual genres in variables. Through them, I work on the texts. As the texts are divided into strings, I unify the strings, so that I create an unique string concerning to each variable.

#Displaying the text in the variable
fiction
#Joining the tokens
joined_fiction = ' '.join(fiction)
#Displaying the result
joined_fiction[:500]

The same way was done for the other variables, using the join( ) function. In order not to make this article a long article, I do not record the processes that concern the respective variables.

I have some work goals on the texts, then I need to treat and process the texts in different ways. The first one concerns some statistics referring to the texts. For that, I make variables that receive the respective texts without punctuation, so that only the words remain.

#Transforming text and taking out punctuations - FICTION
fiction_without_punctuation = []
for word in fiction:
no_punctuation = re.sub(r"[!@#$%¨&*()_+={},.;:/?]", "", word) #Replacing punct by empty space
lowercase_word = no_punctuation.lower()
fiction_without_punctuation.append(lowercase_word)
#Printing the result
print(fiction_without_punctuation[:10])

The same way was done for each text. Below I show the results, answering the first questions.

1.1 Answering the first question: What is the size of the text?

#Searching for the length of the texts
print('Length of the text on fiction:', len(fiction_without_punctuation))
print(
'Length of the text on mystery:', len(mystery_without_punctuation))
print(
'Length of the text on religion:', len(religion_without_punctuation))
print(
'Length of the text on romance:', len(romance_without_punctuation))

1.2 Answering the second question: How many are there distinct words in the text?

#Searching for the number of distinct words
print('Number of distinct words of the fiction text:', len(set(fiction_without_punctuation)))
print(
'Number of distinct words of the mystery text:', len(set(mystery_without_punctuation)))
print(
'Number of distinct words of the religion text:', len(set(religion_without_punctuation)))
print(
'Number of distinct words of the romance text:', len(set(romance_without_punctuation)))

The largest text, considering the repeated words, is the romantic text. The largest text, considering only the distinct words, is the fictional text. To effectively know which text repeats the words more often, I create two measures: the lexical diversity function and the lexical diversity rate function. Below I apply them.

1.3 Answering the third question: How many times does a word appear on average in the text?

#Defining a lexical diversity function
def LexicalDiversity(text):
return len(text)/len(set(text))
#Printing lexical diversity
print('How many a word appears on average in the fiction text:', np.round(LexicalDiversity(fiction_without_punctuation)))
print('How many a word appears on average in the mystery text:', np.round(LexicalDiversity(mystery_without_punctuation)))
print(
'How many a word appears on average in the religion text:', np.round(LexicalDiversity(religion_without_punctuation)))
print('How many a word appears on average in the romance text:', np.round(LexicalDiversity(romance_without_punctuation)))

1.4 Answering the fourth question: In terms of percentage, how many times does a word appear on average in the text?

#Defining a lexical diversity rate function
def LexicalDiversityRate(text):
return (len(set(text))/len(text)) * 100
#Printing lexical diversity rate
print(
'In terms of percentage, a word appears on average in the fiction text: {}%'.format(
np.round(LexicalDiversityRate(
fiction_without_punctuation), 2))
)
print(

'In terms of percentage, a word appears on average in the mystery text: {}%'.format(
np.round(LexicalDiversityRate(
mystery_without_punctuation), 2))
)
print(

'In terms of percentage, a word appears on average in the religion text: {}%'.format(
np.round(LexicalDiversityRate(
religion_without_punctuation), 2))
)
print(

'In terms of percentage, a word appears on average in the romance text: {}%'.format(
np.round(LexicalDiversityRate(
romance_without_punctuation), 2))
)

In the romantic text, the words are repeated more often. In the religious text, the words are repeated less.

To better understand and answer to the fourth and fifth questions, I generate the text chart below.

1.5 Answering the fifth and sixth questions: Which text has on average the most frequent words and the largest size and which text has on average the least frequent words and the smallest size?

#Storing lexical diversity of each text
X_number_word_on_average_by_text = [
np.round(LexicalDiversity(
fiction_without_punctuation)),
np.round(LexicalDiversity(
mystery_without_punctuation)),
np.round(LexicalDiversity(
religion_without_punctuation)),
np.round(LexicalDiversity(
romance_without_punctuation))
]
#Storing the length of each text
y_length_text = [
len(
fiction_without_punctuation),
len(
mystery_without_punctuation),
len(
religion_without_punctuation),
len(
romance_without_punctuation)
]
#Ploting the relation between word repetition and size of the text
#Defining size of the figure

plt.figure(figsize=(12,6))
#Adding the data about the texts
plt.plot(X_number_word_on_average_by_text, y_length_text, linestyle='', ms=0)
#Defining a title
plt.title('Relation between word repetition and size of the text', fontsize=15)
#Defining the names of the axes
plt.ylabel('Size of the text')
plt.xlabel('Presence of word on average in the text')
#Ploting the data using the name of the genres
plt.annotate('Fiction', xy=(X_number_word_on_average_by_text[0] - 0.1, y_length_text [0]), fontsize=15, color='b')
plt.annotate('Mystery', xy=(X_number_word_on_average_by_text[1] - 0.1, y_length_text [1]), fontsize=15, color='b')
plt.annotate(
'Religion', xy=(X_number_word_on_average_by_text[2] - 0.09, y_length_text [2]), fontsize=15, color='b')
plt.annotate(
'Romance', xy=(X_number_word_on_average_by_text[3] - 0.135, y_length_text [3]), fontsize=15, color='b')
#Showing the plot
plt.show()

Observing the chart, the romantic text is the largest text while the religious text is the smallest. With only four samples it is not adequate of concluding that there is a linear relationship between the presence of words on average in each text and the size of each text.

2 Second step: Frequency of words and some statistics

After that I work on the texts without stopwords. I start importing stopwords .

#Importing stopwords
from nltk.corpus import stopwords
print(stopwords.words('english'))

So, I develop a program to eliminate stopwords of each text. However, I only show how I applied it to the fictional text. The same form was applied to the other texts.

#Taking out stopwords
fiction_without_stopwords = []
for token in fiction_without_punctuation:
if token not in stopwords.words('english') and token not in ['', '``', "''", '--', 'would', 'must', 'could']:
fiction_without_stopwords.append(token)
#Confering the presence of stopwords
if fiction_without_stopwords not in stopwords.words('english'):
print(
'There are no stopwords')
elif
fiction_without_stopwords in stopwords.words('english'):
print(
'There is some stopword')

No one of the other texts contains stopwords.

After that process, I verify the result.

#Result
print(fiction_without_stopwords[:10])

The other variables are as expected.

2.1 Counting frequency of words

In this step, I generate four date frame from the four textual genres, containing the words and their respective frequencies. Remember that I eliminated stopwords and punctuations, so that the texts are clean for analysis.

#Frequency of words
freq_dist_fiction = nltk.FreqDist(fiction_without_stopwords)
#Result
freq_dist_fiction

According to the result, the procedure was correctly generated. The same method was applied on the other texts. The results were successful.

To generate the data frame, I perform the following program:

#Generating data frame for freq_dist_fiction
freq_dist_fiction_df = pd.DataFrame({'word':freq_dist_fiction.keys(),
'freq':freq_dist_fiction.values()}).sort_values('freq', ascending=False)
#Correcting index
freq_dist_fiction_df.index = range(len(freq_dist_fiction_df))
#Resulting
freq_dist_fiction_df.head()

The procedure is according to the expected.

After that, I create a new column contains the frequency of the words, in theirs respective texts, in terms of percentage.

#Creating variable - frequency in terms of %
freq_dist_fiction_df['freq%'] = round((freq_dist_fiction_df['freq']/len(freq_dist_fiction_df)) * 100, 2)
#Result
freq_dist_fiction_df.head(10)

I repeat that, in order not to generate an excessively large article, I do not show each procedure on the other texts sometimes, showing only one as demonstration.

For graphical analysis, I generate four word cloud according to the four textual genres. For this, I join the strings of each text, storing them in their respective new variables.

#Joining tokens
fiction_without_stopwords_joined = ' '.join(fiction_without_stopwords)
#Result
fiction_without_stopwords_joined[:500]
#Generating object word cloud and storing the fictional text without stopwords
wordcloud = WordCloud().generate(fiction_without_stopwords_joined)
#Defining shape of the image
plt.figure(figsize=(15,15))
#Ploting word cloud
plt.imshow(wordcloud)
#Defining title
plt.title('Wordcloud of the fictional text without stopwords \n', fontsize=25)
#Turning off the axis
plt.axis('off')
#Ploting the image
plt.show()

As we can see, the words “one”, “man”, “said”, “came”, “back”, along others, stand out in the fictional text. The word cloud represents the respective frequencies of the words, therefore, I can better understand a word cloud using also the frequency data frame.

#Storing the 50 first words of the fictional text
main_words_fiction_context = list(freq_dist_fiction_df.word[:50])
#Printing the first 50 words of the fictional text
print('Context of the fictional text: \n \n', main_words_fiction_context)

What does the word cloud tell me? First, the text has a relationship with the past. It is evidenced by the usage of words such as “time”, “day”, “said”, “came”, and “back”. There is a intense presence of a man. Analyzing the word cloud, we cannot know whether this word “man” is just referring to one man or more then one. Some environments stand out in the text, as room, church, house, home, and night. Thus, thinking about the idea of fiction, we have a text that relates fiction to concepts of church, house, home, night, day, and components that look to the past. It can evidence that the narrative is in third person and the author created a relationship between fiction and church.

From now, I do not show the procedure to generate the other word clouds and the printing of the most frequent words, I just show the results and comment.

Now I analyze the mystery text.

Some of the most frequent words of the mystery text are the same most frequent words of the fictional text, such as “said”, “one”, “back”, “man”, along others. The idea of looking to the past does not seem as intense as in the fictional text. The car is something frequent in the plot. The word “around” is highlighted. The word “door” is more frequent than “house” or “home”. In fact, the concepts of car and door may present more suspense in the film culture. The car figure can also popularly represent a suspense object, as it can be related to to running away or chasing someone or something.

Now I analyze the text concerns religion.

The lexicons present in the text concerning to religion, still in the idea of most frequent words, differ from the other texts under analysis. The most frequent lexicons are related to faith, divinity, and the relationship between spiritual world and man or God and man. The text addresses God, Christianity, and church. From these words, we can infer that the text approaches the Christian world. The text talks about power, death, life, experience, world, and other concepts that pertain to the Christian world.

Lastly, I cover the romantic text.

In the romantic text, we return to a similar lexicon to the fictional and mystery texts. The word “thought” is highlighted. Perhaps, in this romantic text, the introspective behavior shows the emotional element, which is important for a romantic setting. Not addressing the sentimental side can make it difficult to develop a romantic environment.

For better understanding concerning to the texts, we can read the beginning of each text, which have no stopwords.

#Showing the beginning of each text
print(
'FICTION:\n', fiction_without_stopwords_joined[:500], '\n')
print('MYSTERY:\n', mystery_without_stopwords_joined[:500], '\n')
print('RELIGION:\n', religion_without_stopwords_joined[:500], '\n')
print('ROMANCE:\n', romance_without_stopwords_joined[:500], '\n')

From those beginnings, we may have a clearer idea about the texts.

In this moment, I generate a new data frame, which will show the most 50 frequent words of each text. From this, we can have a better idea of similarity along them.

#Making data frame for the 50 first words of each text
main_words_df = pd.DataFrame({'fiction':main_words_fiction_context,
'mystery':main_words_mystery_context,
'religion':main_words_religion_context,
'romance':main_words_romance_context})
#Result
main_words_df.head(20)

The words “said” and “one” are words equally important in their respective texts. Observing the other lines, we can understand the importance of words for all the texts, such as “old”, “man”/“men”, and “like”. A chart showing the position of the word in the text and its frequency can take us to a better understanding about the importance of a same word for each text.

For reasons of not generating polluted chart, I create a chart that plots only the 10 most frequent words of each text. From this, we can understand the relationship of the 10 words with the different textual genres.

#Defining the size of the figure
plt.figure(figsize=(20,10))
#Generating chart
for i in range(10):

#Defining axes
plt.plot(list(np.arange(0, 10, 1)), list(np.arange(0, 400, 40)), linestyle='')

#Ploting the data as words
plt.annotate(freq_dist_fiction_df.word[i],
xy=(
i, freq_dist_fiction_df.freq[i]),
fontsize=20,
color='blue')
plt.annotate(freq_dist_mystery_df.word[i],
xy=(i, freq_dist_mystery_df.freq[i]),
fontsize=20,
color='black')
plt.annotate(
freq_dist_religion_df.word[i],
xy=(
i, freq_dist_religion_df.freq[i]),
fontsize=20,
color='orange')
plt.annotate(freq_dist_romance_df.word[i],
xy=(
i, freq_dist_romance_df.freq[i]),
fontsize=
20,
color=
'red')

#Setting the title
plt.title('Dispersion of the 10 most frequent words per text \n', fontsize=25)

#Setting the axes
plt.ylabel('Frequency', fontsize=20)
plt.xlabel('Position', fontsize=20)

#Generating the legend
plt.annotate('Legend: \n', xy=(8.5, 350), color='black', fontsize=20)
plt.annotate('fiction \n', xy=(8.5, 335), color='blue', fontsize=15)
plt.annotate('mystery \n', xy=(8.5, 320), color='black', fontsize=15)
plt.annotate('religion \n', xy=(8.5, 305), color='orange', fontsize=15)
plt.annotate('romance \n', xy=(8.5, 290), color='red', fontsize=15);

In the text concerning religion, “God” is more frequent than “said”. The word “said” is not even among the 10 most frequent words. This text may not be a narrative text like the other texts are. It can be a didatic text are something related to teaching or Christian philosophical thought. The romantic text makes more comparisons, using the word “like”, than other texts. The mystery text presents a scenario with the idea of back.

3 Third step: Knowing some words in context

In this step, I show three words of the most frequent words present in each text. The selected words are “said”, “one”, and “man”.

I start analyzing the word “said” in the four texts.

3.1 Said

#Searching for context
#Creating nltk type variable
fiction_text = nltk.Text(fiction)
#Displaying contexts
fiction_text.concordance('said', width=100)

Observing the first 25 lines, in the fictional text, “said” is used to introduce the speeches of characters.

#Searching for context
#Creating nltk type variable
mystery_text = nltk.Text(mystery)
#Displaying contexts
mystery_text.concordance('said', width=100)

In the mystery text, “said” is used to introduce speeches of characters in first and third persons.

#Searching for context
#Creating nltk type variable
religion_text = nltk.Text(religion)
#Displaying contexts
religion_text.concordance('said', width=100)

In the religion text, the word “said” introduces what someone or something informs about something.

#Searching for context
#Creating nltk type variable
romance_text = nltk.Text(romance)
#Displaying contexts
romance_text.concordance('said', width=100)

In the romantic text, “said” is introduced as speeches of characters. It is used in first and third persons.

From now on, I will not show the code, I just transwrite the output of the program.

3.2 One

In the fictional text, the word “one” is used as numeral and indefinite pronoun:

In the mystery text, the word “one” is used as numeral and indefinite pronoun:

In the religion text, the word “one” is used as numeral and indefinite pronoun:

In the romantic text, the word “one” is used as numeral and indefinite pronoun:

3.3 Man

In the fictional text, the word “man” is designed to specific men, which are characters.

In the mystery text, the word “man” is designed to specific men, which are characters, and also as synonym of “husband”.

In the religion text, the word “man” is used synonymously with “mankind”.

In the romantic text, the word “man” is used for specific characters and as part of the predicate.

3.4 Answering the seventh question: What is the context in which three of the most frequent words are?

The texts use the words “said” and “man” differently, while using the word “one” equally, that is, as an indefinite pronoun and a numeral. The use of “said” is often related to the introduction of the characters’ speech. The word “man” is used more variously. It is used synonymously with humanity and husband, and is also used to refer to unnamed characters.

We can hypothetically infer that words like “one”, that is, numeral and pronoun, are more frequent in different textual genres. The word “said” tends to be used as an introduction to the speech of characters in different textual genres. The word “man” often assumes the position of synonyms in different textual genres. To prove this, a detailed study with a larger sample of textual genres is necessary.

Final considerations

Through the NLP, we can study the characteristics of different textual genres and capture their main patterns. Another possibility is the opportunity to make hypotheses that can be studied i n more detail and be accepted or rejected.

--

--

Bruno Carloto

Bem-vindo ao Deep Analytics, um blog que aborda de forma técnica o mundo Analytics | LinkedIn: www.linkedin.com/in/bruno-rodrigues-carloto