NLP — Text Preprocessing

Buse Köseoğlu
3 min readDec 15, 2023

--

Image from https://hpccsystems.com/resources/understanding-natural-language/

Since the data to be used in NLP projects is text data, it has an unstructured structure and, as in other projects, it is very important to prepare the data before moving on to the model. There may be some changes in the data preprocessing steps depending on the purpose of the project. Therefore, each step can be revised according to the project. These steps are generally as follows. Again, different steps can be added if necessary depending on the data to be studied (for example, if working with tweets and user names are included in the collected data, an additional step can be added to extract them).

  1. Removing spaces
  2. Removing punctuation
  3. Removing Numbers
  4. Lower Casing
  5. Stopwords Removal
  6. Stemming
  7. Lemmatization
  8. Tokenization

We will use the movie reviews data available on Kaggle to practically examine the above techniques.

Removing spaces

There may be extra spaces at the beginning, end of sentences or between words. Correcting these anomalies allows us to work with cleaner data.

# Removing spaces in sentences
df["review"] = df["review"].str.replace(r' +', '')

# Remove leading and trailing spaces.
df["review"] = df["review"].str.strip()

Removing Punctuation

While punctuation marks may not make sense in some projects, they may make sense in others. While they are generally removed in models such as sentiment analysis and spam detection, they can be used in text generation models as they make sense.

import string

print(string.punctuation)

df['review'] = df['review'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

string.punctuation output is ‘!”#$%&\’()*+,-./:;<=>?@[\\]^_`{|}~’

Removing Numbers

Numbers, just like punctuation marks, can be removed from sentences when they do not make sense.

df['review'] = df['review'].str.replace('\d+', '')

Lower Casing

While the word “Happy” and “happy” mean the same thing to us when we read sentences, they do not mean the same thing to the machine. That’s why it’s important to convert all words to lowercase.

df['review'] = df['review'].str.lower()

Stopwords Removal

Stopwords are found in every language and are words that are used frequently in a language but do not contribute much to the meaning of the sentence. There are default lists provided by libraries for different languages. New words can be added to these lists as needed. When choosing these new words, care should be taken to ensure that there is no loss of meaning.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop = stopwords.words('english')

df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

Stemming and Lemmatization

Stemming is the method of reducing the word to its root. An important point to consider when using this method is that when reducing to the root, it can sometimes produce words that do not exist or trim the word too much to reduce it to the root.
Lemmatization is similar to stemming, but here the word is replaced by semantically similar words. In other words, it makes a reduction according to dictionary meanings and returns a word that actually exists.
So which one should we use and where? If we are doing a meaning-oriented project (for example, a text generator), then it would make more sense to use lemmatization. Because it will give us existing, real, meaningful words. However, if we are working on projects such as sentiment analysis or spam detection, using stemming may yield better results since the important thing here is to capture the pattern.

Stemming example:

from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

# Stemming all rows
df['review_stemmed'] = df['review'].apply(lambda x: ' '.join(stemmer.stem(word) for word in x.split()))
df['review_stemmed']

Lemmatization example:

from nltk.stem import WordNetLemmatizer

# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

df['review_lemma'] = df['review'].apply(lambda x: ' '.join(wnl.lemmatize(word) for word in x.split()))
df['review_lemma']

Tokenization

Tokenization is the process of breaking a paragraph or sentence into smaller parts. Generally, tokenization is performed by breaking sentences into words.
Each element in the resulting list is a token. These elements may include numbers and punctuation marks. They don’t have to consist of just words.

from nltk.tokenize import word_tokenize

df['review_tokens'] = df['review_lemma'].apply(word_tokenize)

Thank you for reading!

--

--