Natural Language Processing Pipeline

A map for your studies and projects

Bruno Carloto
6 min readJul 19, 2021

Introduction

In this article, I cover the development, or rather, a Natural Language Processing (NLP) pipeline. Thus, interested parties can benefit from a guiding content in the development stages of study or project.

For those ones looking for a definition, NLP is a sub-area of ​​machine learning that works with natural language, whether dealing with text or audio. In technical terms, it studies the capabilities and limitations of machines to understand human language. A pratical application now is that while the text is written in English language, you probably are reading it in your native language, if not English. Is it truth? I also emphasize that Machine Learning (ML) is a sub-area of ​​Artificial Intelligence (AI).

As shown in the image above, NLP is the result of the interdisciplinarity between Linguistics and AI.

There are two interesting concepts to mention: Natural Language Generation Systems (NLGS’s) and Natural Language Understanding Systems (NLUS’s). As far as NLG’s are concerned, they convert information from computer databases into human-understanble language and, as compared to the NLUS’s, these convert human language occurances into more formal representations more easily manipulated by computer programs.

Human language is not scientifically simple for a machine. One of its main challenging features is semantic ambiguity. We, as humans, have two fundamental factors that separate us from machines in terms of natural language: common cultural knowledge and prior experience.

Another important factors are the context and tone of voice. Here you are faced with issues about feelings and emotions together. Machines do not contain feelings and emotions, nor can intrinsically and truly understand them.

That said, it is possible to realize that NLP is a brige between human and machine. Furthermore, it is a great tool to help human beings in processes and automations of tasks that would streamline certain human needs, such as, for example, producing specific and everyday documents.

Some other applications are sentiment analysis, contract review, machine translation, among others.

Now, let’s start!

Pipeline

We can divide the NLP development process into five stages: i) data acquisition, ii) data cleaning, iii) pre-processing, iv) training, and v) data evaluation.

Let’s start with data acquisition.

  • Data acquisition

In technical terms, data acquistion, when referring to NLP, is denominated corpus acquisition, for dataset in NLP is that way named. Some methods of acquiring data are web scraping and crawling. Selenium, Requests and Beautiful Soup are tools for that. It is possible to have and use data from databases such as SQL and Sparks. There is also the possibility of using structured corpus of third parties, such as COLT, IMDB reviews and Standford Sentiment Treebank. From those, you develop a NLP initial model to put in interectaion with users or clients, to, thus, improve the model.

You must be logical when acquiring a corpus, for the data must assist the needs of your project or study. Imagine two models, one aimed at education and the other at identifying toxicity. While you must eliminate offensive words for a model, for the other, you allow them. Another factor to be clean up is stereotyping trends.

  • Data cleaning

StopWords Removal

Generally, prepositions and articles are removed. The idea is that words connecting ideas are removed. However, for other models, they are important, such as Translation, Question and Answer (Q&A) and Natural Language Understanding (NLU) tasks, wich can suffer from the loss of those words. Again, you must be logical in your work. Differently from these, a model that fits well this procedure is a spam detector.

Below is an example of this procedure:

Offensive words fit this example, as stated above.

  • Pre-processing

The example: Bag-of-Words (BOW)

This is simply the count of terms present in text. The procedure is you create a dictionary or list referring to the words present in your text and count their repetitions. Below is an illustration:

Note that there are five sentences on the left side of the frame: i)It is a puppy, ii)It is a kitten, iii)It is a cat, iv)That is a dog and this is a pen and v)It is a matrix. Above the frame, there are words encounted from some of the sentences. Below them there is the presence count of those words, metioned above the frame, within sentences. If you add up all columns, you will receive the number of times that each word appear in the texts. Above we have structured data and unstructured data. From the unstructured data, the raw material, we build the structured data, the middle and fundamental part of the NLP process. It is necessary to pay attention for this method. It is not an absolute rule, it is necessary to master the method to know when and how to apply it.

Through this method you need to be aware of the Curse of the Dimensionality so that your model is not inefficient. Another issue is the need to normalize the importance of words. Finally, I highlight that this method does not differentiate common words from more specific words. Imagine that a text has 100 words “very” and only one “basketball”. Although “very” appears 100 times in the text, nothing can be inferred about its content. However, just reading the word “basketball” once, it is possible to infer that the text talks about this sport. To solve this problem, the TF-IDF (Term Frequency-Inverse Document Frequency) was developed.

TF-IDF

The idea behind this method is to first count an “a” term within an “x” document and divide it by the number of terms within the “x” document. It considers repetition of terms. After, there is the division between the number of documents by the number of documents in wich “a” appears, being the quotient treated by log. Below is the mathematical formula:

But, that model does not deal with Curse of Dimensionality, as its approach considers the entire document to the point where many representations are equal to zero, not being significantly different from the Bag-of-Words. To solve this issue, there is the Word2Vec algorithm.

Word2Vec

It is an neural networks algorithm that does not represent an entire document. Its approach is to consider that the meaning of the words is given by the context, or rather, by the neighboring words. I will not go into details about this algoritm, but I would like to point out that, for the subject of this topic, it has been one of the main tools.

Finally I provide this explanatory schema, got on Google, of its objetive function:

While the window scrolls through the text, the probability calculation for each word is given by the following formula:

  • Training

There are numerous possibilities to apply NLP. To decide wich model to use, if in doubt, start by consulting the literature. Make a baseline. Use NLTK, Gensim and Spacy. It is not advisable to define something here when dealing with such comprehensive content.

  • Evaluation

Just as there are many tasks, there are many metrics. So, just like what was said above, consult the literature. There is already a considerable amount of work done demonstrating the performance of models and evaluation methods.

Final words

To have a successful career, in terms of quality, it takes commitment, planning, study and action. I hope we can develop the world that way.

Make your observations, corrections, tips below. I’m waiting for more learning.

Grateful for your reading!

--

--

Bruno Carloto

Bem-vindo ao Deep Analytics, um blog que aborda de forma técnica o mundo Analytics | LinkedIn: www.linkedin.com/in/bruno-rodrigues-carloto