NLP — Text Preprocessing
Since the data to be used in NLP projects is text data, it has an unstructured structure and, as in other projects, it is very important to prepare the data before moving on to the model. There may be some changes in the data preprocessing steps depending on the purpose of the project. Therefore, each step can be revised according to the project. These steps are generally as follows. Again, different steps can be added if necessary depending on the data to be studied (for example, if working with tweets and user names are included in the collected data, an additional step can be added to extract them).
- Removing spaces
- Removing punctuation
- Removing Numbers
- Lower Casing
- Stopwords Removal
- Stemming
- Lemmatization
- Tokenization
We will use the movie reviews data available on Kaggle to practically examine the above techniques.
Removing spaces
There may be extra spaces at the beginning, end of sentences or between words. Correcting these anomalies allows us to work with cleaner data.
# Removing spaces in sentences
df["review"] = df["review"].str.replace(r' +', '')
# Remove leading and trailing spaces.
df["review"] = df["review"].str.strip()
Removing Punctuation
While punctuation marks may not make sense in some projects, they may make sense in others. While they are generally removed in models such as sentiment analysis and spam detection, they can be used in text generation models as they make sense.
import string
print(string.punctuation)
df['review'] = df['review'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
string.punctuation output is ‘!”#$%&\’()*+,-./:;<=>?@[\\]^_`{|}~’
Removing Numbers
Numbers, just like punctuation marks, can be removed from sentences when they do not make sense.
df['review'] = df['review'].str.replace('\d+', '')
Lower Casing
While the word “Happy” and “happy” mean the same thing to us when we read sentences, they do not mean the same thing to the machine. That’s why it’s important to convert all words to lowercase.
df['review'] = df['review'].str.lower()
Stopwords Removal
Stopwords are found in every language and are words that are used frequently in a language but do not contribute much to the meaning of the sentence. There are default lists provided by libraries for different languages. New words can be added to these lists as needed. When choosing these new words, care should be taken to ensure that there is no loss of meaning.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
Stemming and Lemmatization
Stemming is the method of reducing the word to its root. An important point to consider when using this method is that when reducing to the root, it can sometimes produce words that do not exist or trim the word too much to reduce it to the root.
Lemmatization is similar to stemming, but here the word is replaced by semantically similar words. In other words, it makes a reduction according to dictionary meanings and returns a word that actually exists.
So which one should we use and where? If we are doing a meaning-oriented project (for example, a text generator), then it would make more sense to use lemmatization. Because it will give us existing, real, meaningful words. However, if we are working on projects such as sentiment analysis or spam detection, using stemming may yield better results since the important thing here is to capture the pattern.
Stemming example:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
# Stemming all rows
df['review_stemmed'] = df['review'].apply(lambda x: ' '.join(stemmer.stem(word) for word in x.split()))
df['review_stemmed']
Lemmatization example:
from nltk.stem import WordNetLemmatizer
# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()
df['review_lemma'] = df['review'].apply(lambda x: ' '.join(wnl.lemmatize(word) for word in x.split()))
df['review_lemma']
Tokenization
Tokenization is the process of breaking a paragraph or sentence into smaller parts. Generally, tokenization is performed by breaking sentences into words.
Each element in the resulting list is a token. These elements may include numbers and punctuation marks. They don’t have to consist of just words.
from nltk.tokenize import word_tokenize
df['review_tokens'] = df['review_lemma'].apply(word_tokenize)
Thank you for reading!