Computational Linguistics or NLP applied to the linguistic and literary study

A practical approach

Bruno Carloto
11 min readFeb 10, 2022

1. Introduction

One of the excellent fields of knowledge for linguists and literati is Computational Linguistics. The application of statistical and mathematical methods in linguistic and literary studies, with the aid of programming, allows a greater understanding of these fields of study, in addition to, for example, the opportunity to recreate the teaching of Literature and Writing and the development of software based on Natural Language Processing (NLP).

In more specific terms, Computational Linguistics is the scientific study of language from a computational perspective. Its approaches are related to the linguistic and psycholinguistic spheres. From this, computational models are built and serve to answer questions related to these scopes. Therefore, this field is multidisciplinary, concatenating Statistics, Artificial Intelligence, Computing and Linguistics.

The language is a mirror of mind, then, a computational understanding of language provides insights into thinking and intelligence. Based on the knowledge generated, the communication between men and machines is possible from elaborate intelligent softwares. For example, I cite a few:

  • speech recognition systems;
  • text-to-speech synthesizers;
  • automated voice response systems;
  • automatic translator system;
  • intelligent virtual assistent

Content

  1. Introduction
  2. Practice

2.1 Importing base libraries and corpus

2.2 Analyzing the lexicon

2.3 Analyzing the some semantic aspect

2. Practice

2.1 Importing base libraries and corpus

I start importing the base libraries.

#Importing base libraries
import numpy as np
import pandas as pd
import nltk
import re
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

After importing, I import the exemplar books that belong to the ecosystem of the NLTK tool. Of those, I select the book of Genesis, which serves us for our demonstrative study.

#Importing study texts
from nltk.book import *

Below, I display the corpus The Book of Genesis. As it can be observed, the text is not showed due to the data type.

#Displaying the text of Genesis
text3, type(text3)

For better assimilation, I store the corpus into the variable genesis. Now, we can quickly see and understand which book/corpus we are working on.

#Storing the text of Genesis into a specified variable
genesis = text3

2.2 Analyzing the lexicon

An important feature of a natural language is its lexicon. Every speech, text, book, and poem is a lexical subset. Lexicon is the set of all existenting words that compound a given natural language. Therefore, the book of Genesis presents a lexical subset of the English language.

Studying a certain lexical subset present in a text or in a text set of a textual genre helps us to understand this genre. If we study distinct textual genres, we can conclude important aspect on those textual genres.

We can see the text of Genesis, in a complete way, joining the words of the book in an unique string. This is because the words are not joined, but, separated like this: [ ‘In’, ‘the’, ‘beginning’, … ]. In this universe, they are also named tokens.

To join them, we use the join function — join( ). Below, I show the beginning of the text, aiming not to generate a long text on the screen.

#Printing the text of Genesis
genesis_unique_string = ' '.join(genesis)
genesis_unique_string[:947]

Depending on the approach of study, as seeing the principle meaningful words of a text, a common procedure is to take out the stopwords of the text. Stopwords are words that do not carry infotmative meaning within themselves, such as in, the, our etc. Below I show some stopwords of the English language.

#Importing stopwords feature
from nltk.corpus import stopwords
#Displaying some english stopwords
stopwords.words('english')[:10]

I can study sentences. To generate sentences I can use the function sent_tokenize( ). The sentence is separated from fullstop to fullstop. However, in our case, I split the text into sentences to take out stopwords.

#Tokenizing sentences
sentences_of_genesis = nltk.sent_tokenize(genesis_unique_string)
sentences_of_genesis[:10]

In the first code, I remove the stopwords. In the second code, I confer whether there are or not stopwords.

#Creating a variable without stopword
genesis_without_stopwords = []
for i in range(len(sentences_of_genesis)):
words_of_sentence = nltk.word_tokenize(sentences_of_genesis[i])
genesis_without_stopwords.append([word for word in words_of_sentence if word not in stopwords.words('english')])
#Confirming whether there are or not stopwords
if genesis_without_stopwords not in stopwords.words('english'):
print('There are not stopwords.')
elif genesis_without_stopwords in stopwords.words('english'):
print('There are stopwords.')

According to the second code, the first code worked fine and there are no more stopwords.

The first code broke the text into token(s). Below, I evidence that.

#Displaying text of genesis without stopwordS
print(genesis_without_stopwords[:10])

Analyzing the generated text, we can see that there are still irrelevant words for the context of our study, such as in. Then, I make a code to remove them. For that, I use the concept of part of speech. The function that approaches that is pos_tag( ). In the first code, I make a variable that receives every word and its respective classification in terms of part of speech. In the second code, I store, for our use, the words that belong to the part of speech that interests me.

#Defining parts of speech of each word of the text of Genesis
sentences_pos_tag = []
for word in genesis_without_stopwords:
sentences_pos_tag.append(nltk.pos_tag(word))
#Droping non-interesting parts of speech of the the text of Genesis
genesis_without_stopwords_2 = []
for i in range(len(sentences_pos_tag)):
for j in range(len(sentences_pos_tag[i])):
if sentences_pos_tag[i][j][1] in ['FW', 'JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP',
'NNPS', 'POS', 'RB', 'RBR', 'RBS', 'UH', 'VB',
'VBG', 'VBD', 'VBN', 'VBP', 'VBZ']:
genesis_without_stopwords_2.append(sentences_pos_tag[i][j][0])

As we can see below, in is no longer present.

#New genesis without stopwords
print(genesis_without_stopwords_2[:10])

Lastly, I remove the characters demonstrated in the code below and transform or define every letter lowercase:

#Transforming words
genesis_without_stopwords_4 = []
for word in genesis_without_stopwords_3:
w = re.sub(r"'!@#$£%¢¨¬&*()_-=+§\|₢,.;:^][{ªº}]'''", '', word) #Droping unnecessary punctuation
w = w.lower()
genesis_without_stopwords_4.append(w)

As it can be seen, the procedure is correct.

#Printing result
print(genesis_without_stopwords_4[:101])

Now, we have 2482 tokens.

#Counting number of words
len(set(genesis_without_stopwords_4))

To generate a word cloud, I join the restant word/token using the join function.

#Generating a unique string
cleaned_genesis = ' '.join(genesis_without_stopwords_4)
cleaned_genesis[:1000]

Then, I generate the word cloud.

#Create and generate wordcloud
wordcloud = WordCloud().generate(cleaned_genesis)
#Display the generated image
plt.figure(figsize=(20,20))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

The word cloud is useful for showing the most frequent words in a given text. From it, we can infer some conclusions about the plot of the text, such as characters, actions, environments, etc., in addition to finding stopwords.

To know the exact frequency of words, I generate a word frequency dictionary. Then, I make a dataframe with the dictionary.

#Generating frequency dictionary 
genesis_freq_dist = FreqDist(genesis_without_stopwords_4)
genesis_freq_dist, type(genesis_freq_dist)
#Gererating dataframe of terms and theirs counting
dist_freq_df = pd.DataFrame({'term':genesis_freq_dist.keys(),
'count':genesis_freq_dist.values()}).sort_values('count', ascending=False)
#Displaying the dataframe and correcting the index
dist_freq_df.index = [ind for ind in range(len(genesis_freq_dist))]
dist_freq_df.head(10)

We can know lexical diversity. Lexical diversity tells us about the diversity of a text in terms of lexical subset, that is, how many times on average a word appears in the text. For that, I define the following function:

#Defining lexical diversity function
def lexical_diversity(text):
return print('On average, a clean word appears {} times in the text of Genesis.'.format(round(len(text)/len(set(text)),2)))

Below I define the lexical diversity rate. While that one above shows how many times on average a word appears in the text, this one shows, in terms of percentage, how diverse is the text. The closer to 100%, the fewer words are repeated, that is, the text is more diversified; the closer to 0%, the more words are repeated.

#Defining percentage of lexical diversity function
def lexical_diversity_rate(text):
return print('The lexical diversity is about {}%.'.format(round((len(set(text))/len(text)) * 100, 2)))
#Counting the lexical diversity
lexical_diversity(genesis_without_stopwords_4)
#Counting lexical diversity rate
lexical_diversity_rate(genesis_without_stopwords_4)

To know effectively whether a text is diversified in relation to the lexical subset, it is necessary to have a known pattern about the studied genre. Consequently, I do not infer something on it.

But, I plot the chart below, showing the position of the lexical diversity rate (red data point). The X and Y axes are just the values between 0% and 100%. The closer to 100%, the more distinct words there are in the text, therefore, the fewer words are repeated. The closer to 0%, the fewer distinct words there are in the text, consequently, the more words are repeated.

#Generating a comparative chart on the lexical diversity rate of the text of Genesis
plt.figure(figsize=(10,4))
plt.plot(np.arange(0, 100, 10), np.arange(0, 100, 10), marker='o', linestyle='', label='Comparison point')
plt.plot(14.61, 14.61, marker='o', color='red', linestyle='', label='Lexical Diversity Rate')
plt.legend()
plt.title('
Lexical diversity rate line', fontsize=15)
plt.ylabel('%', fontsize=12)
plt.xlabel('%', fontsize=12)
plt.show()

After observing the above, I demonstrate the lexical dispersion of some of the most frequent words. More specifically, I show the first 20 words. This demonstration can be done by the lexical dispersion plot, so that we can observe when a set of words starts and ends in the text, which are the most frequent, the semantic aspect of a certain passage of a text, etc.

#Transforming list into nltk text
genesis_nltk = nltk.Text(genesis_without_stopwords_4)
genesis_nltk
#Ploting word offset
plt.figure(figsize=(15,5))
genesis_nltk.dispersion_plot(list(dist_freq_df['term'][:20].values));

As we can see, for example, the word God starts with more frequency in the beginning of the book and ends with less frequency. The word Jacob only appears in the middle of the book. A word can be the most frequent, however, appears only in few passages, while another one can appear in several passages, but, its frequency is lower.

We can build a column containing the word presence rate, so that we will have another usuful metric.

#Building word presence rate function
def word_presence_rate(text, word):
return round((text.count(word)/len(text))*100, 4)
#Generating word presence list
word_presence_list = []
text = genesis_nltk
for w in set(genesis_nltk):
word_presence_list.append(word_presence_rate(text, w))
#Storing word presence list into the dataframe
dist_freq_df['word_presence%'] = sorted(word_presence_list, reverse=True)
#Displaying the result
dist_freq_df.head(15)

Above is the new columns created — word_presence%. If we sum the values of that column, we observe that the result is about 100%. Consequently, the procedure is correct.

#Summing the percentages to confer whether the summatory is 100%
round(dist_freq_df['word_presence%'].sum(), 2)

Below I demonstrate that it is possible to extract statistics from the text, as below:

#Generating statistical resume
dist_freq_df.describe()

For example, the table indicates that 75% of the words in the book of the Genesis are repeated up to four times, ignoring the stopwords that were removed.

Below, I store in a new variable just words that appears at least 10% of the total times that the most frequent word appear.

#Selecting words that appears at least 10% of the total times that the most frequent word appear
optimized_dist_freq_df = dist_freq_df[dist_freq_df['word_presence%'] > (dist_freq_df['word_presence%'].max() * 0.1)]
#Displaying optimized df
optimized_dist_freq_df.head(10)

2. 3 Analyzing some semantic aspects

Another important aspect is semantics. The semantics present in a text is what gives meaning to the narrative, which shows the intention of the author or characters. We can understand the pattern in which the author submits a word or a set of words.

Using the concordance( ) function, we search the context in which some words are inserted. That can be seen in the next codes.

#Searching the context of word said
genesis.concordance('said', width=83, lines=20)

Word said appears frequently in connection with the word God. It demonstrates the superiority of God over the creation.

#Searching the context of word shall
genesis.concordance('shall', width=80, lines=20)

The word God is several related to the word shall.

Using the common_contexts( ) function, we can know which contexts a set of words appears.

#Searching the common context
genesis.common_contexts(['God', 'said'])

Above we see that the words God and said appear together in the contexts where the word lord is present. We can think that God is a lord that say something. God is not a enemy, he is a lord.

#Searching the common context
genesis.common_contexts(['God', 'Jacob'])

Regarding the words God and Jacob, they are in contexts where there are actions, such as saying, blessing, going, hearing and having. We can think that both characters relate to each other.

#Searching the common context
genesis.common_contexts(['Jacob', 'God'])

If we invert the order of the words, a different order of contexts can be returned.

We can understand how the author uses a word in relation to semantics. A word may be used by one author with a negative connotation, while by another it may be used with a positive connotation.

#Searching similar words
genesis.similar('God', num=100)

God is a word that is related to the above words, according to the similar( ) function. The same logic applies to the example below.

#Searching similar words
genesis.similar('said', num=100)
#Searching similar words
genesis.similar('Jacob', num=100)

Finally, the collocations( ) function returns a set of bigrams, that is, the set of word pairs that appear most frequently. Through this, we can infer about the narrative of a text or its information.

#Searching collocations
genesis.collocations(num=100)

Final considerations

I hope this content has been helpful to you. There is more content on my blog, where I cover artificial intelligence and machine learning, as well as other content about Natural Language Processing and/or Computational Linguistics.

--

--

Bruno Carloto

Bem-vindo ao Deep Analytics, um blog que aborda de forma técnica o mundo Analytics | LinkedIn: www.linkedin.com/in/bruno-rodrigues-carloto