Eurovision Song Analysis: Topic Analysis and NER
In this project, the songs competing in Eurovision were examined. The aim of the project is to identify the differences between the winning and losing songs and to offer suggestions accordingly. You can find full code from github repo.
Project steps:
- Data Preprocessing
- Topic Modelling
- Post-Processing
- Sentiment Analysis
- Named Entitiy Recognition
- Results
Data Preprocessing
Data preprocessing is an important step for NLP, as in other machine learning projects. As a data scientist, the smoother we can make the data, the better the results we get. Since we work with unstructured data, the data preprocessing steps differ from the structured structure. You can find the article in which I explain the data preprocessing steps for NLP here.
Data Cleaning: Data preprocessing involves cleaning up unnecessary or noisy data. Steps such as removing unnecessary information such as special characters, numbers, punctuation marks, correcting typos or removing irregularities improve the quality of the data.
To remove spaces, numerical and special values from data:
def remove_spaces(df):
# remove \n
df["Lyrics translation"]=df["Lyrics translation"].str.replace('\n', ' ')
# convert multiple spaces to single spaces
df["Lyrics translation"] = df["Lyrics translation"].str.replace(' \s+', ' ', regex=True)
# removes leading and trailing spaces
df["Lyrics translation"]=df["Lyrics translation"].str.replace('^\s+', '', regex=True) #front
df["Lyrics translation"]=df["Lyrics translation"].str.replace('\s+$', '', regex=True) #end
return df
df = remove_spaces(df)
def norm_doc(single_doc):
single_doc = re.sub(r"[^\w\s]", " ", single_doc) # remove special characters
single_doc = re.sub("w*[0-9]+", " ", single_doc) # remove numbers
return single_doc
norm_docs = np.vectorize(norm_doc)
df["Lyrics translation"] = norm_docs(df["Lyrics translation"])
Additionally, changing the text to all lowercase eliminates the difference between the word “happy” and “Happy”. However, won’t and will not, used as abbreviations, mean the same thing, but we cannot distinguish them when the short form is used. That’s why we need to change such abbreviations. Different abbreviations can also be added to these.
# Creation of new lowercase prep_lyric variable. All operations will be performed on this variable.
df["prep_lyric"] = df["Lyrics translation"].str.lower()
# Correction of abbreviations
df["prep_lyric"] = df["prep_lyric"].str.replace("won't", "will not").replace("can't", "can not").replace("n't", " not")
Lemmatization: lemmatization of words in texts provides semantic consolidation by combining word variations. These steps are used to capture semantic similarity at the word level and are often based on grammar rules.
# Application of the lemmatizer process
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
words = text.split()
words = [lemmatizer.lemmatize(word,pos='v') for word in words]
return ' '.join(words)
df['prep_lyric'] = df['prep_lyric'].apply(lemmatize_words)
Removing Stop Words: Stop words are frequently used words such as grammatical conjunctions, prepositions, and pronouns. As these words often do not have semantic significance, they can create noise in the analysis and modeling processes. Stop words removal aims to obtain more focused and meaningful texts by filtering these words.
NLTK library provides us with specific stopwords for English. However, when we examine the data, we can see that there are words that fit the definition of stopword outside the list provided by the library. That’s why we need to add these words as stopwords.
# removing stopwords
stop = stopwords.words('english')
# Adding words that could be stopwords for this dataset
stop = stop + ["lala","ooh","sing","gonna","wanna","hey","new","nana","lalala","gotta","yes","may","lala","gave","ever","many","without","much","let","yeah","take","make","know","would","tell","make",
"come","get","one","two","want","see","away","look","always","give","say","even","every","everything","everybody","run","could","should","day","cause","chorus","yay","duh",
"yum","lai","lee","mikado","true","diggi","aah","bang","shi","dong","tom","pillibi","boom","duy","cululoo","para","nanana","jennie","rimi","hule","lalalala","moni","ela","dap",
"yerde","hoo","nigar","aman","hiah","dai","doba","gasimov","nananana"]
df["prep_lyric"] = df["prep_lyric"].str.lower().str.split()
df["prep_lyric"] = df["prep_lyric"].apply(lambda x: [item for item in x if item not in stop])
df = df.assign(prep_lyric=df.prep_lyric.map(' '.join))
Visualization
We can examine the cleaned data according to word frequencies. First, we will look at the most recurring words.
text = " ".join(review for review in df.prep_lyric)
print ("There are {} words in the combination of all review.".format(len(text)))
# Import Counter
from collections import Counter
# Join all word corpus
review_words = ','.join(list(df['prep_lyric'].values))
# Count and find the 30 most frequent
counter = Counter(review_words.split())
most_frequent = counter.most_common(30)
# Bar plot of frequent words
fig = plt.figure(1, figsize = (10,6))
_ = pd.DataFrame(most_frequent, columns=("words","count"))
sns.barplot(x = 'words', y = 'count', data = _, palette = 'winter')
plt.title("Most Frequent Words");
plt.xticks(rotation=45);
We get the following graph as output. When we examine this graph, we can see that, not surprisingly, the most used word is “love”.
Another visual we can also examine is WordCloud. WordCloud is a visual frequently used in NLP projects and created based on word frequencies. Here, frequently repeated words are written larger. If desired, it can also be added to images such as logos. To create WordCloud;
# Generate a word cloud image
wordcloud = WordCloud(background_color="white").generate(text)
# Display the generated image:
# the matplotlib way:
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Topic Modelling
In topic modeling, abstract topics are produced by clustering words that frequently appear together in the text, and relevant texts are placed in one or more clusters that are closest to them according to the words they contain.
TF-IDF
- TfidfVectorizer is a vectorization tool used to perform the “Term Frequency-Inverse Document Frequency” (TF-IDF) conversion. It is a component frequently used in text analysis studies such as topic modeling.
- TfidfVectorizer is used to represent text data with a numeric vector. This vectorization process uses TF-IDF scores to determine the importance of words in each document.
- TF (Term Frequency) refers to how often a word occurs in a document. The higher the frequency of a word in the document, the more important that word is for that document.
- IDF (Inverse Document Frequency) refers to how rare a word is in all documents. Rare words usually carry more information and have more distinctive features.
LDA
- LDA (Latent Dirichlet Allocation) is a probability model used in text datasets and is a widely used method for topic analysis.
- LDA assumes that each document belongs to one or more topics and aims to find the distribution of these topics in the documents. Based on the probability distribution of the documents in the text dataset, LDA tries to find the hidden topics and the distribution of the words they contain.
First, we transform the data with TF-IDF Vectorizer. Then we create an LDA object. Here we use GridSearchCV to determine the optimum parameters.
# TF-IDF Vectorizer for lyric data
vectorizer = TfidfVectorizer(stop_words = stop, min_df=0.1 ,max_df=0.20)
tfifd = vectorizer.fit_transform(df['prep_lyric'])
# Define Search Param
search_params = {'n_components': range(5,15), 'learning_decay': [.5, .7, .9]}
# Init the Model
lda = LatentDirichletAllocation(max_iter=5 ,random_state=0)
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(tfifd)
# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
After training the model, the parameters that give the best performance are {‘learning_decay’: 0.5, ‘n_components’: 5}. Then, we determine the dominant topic for each data together with the model.
# Create Document — Topic Matrix
lda_output = best_lda_model.transform(tfifd)
# column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]
# index names
docnames = ["Doc" + str(i) for i in range(len(df))]
# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic["dominant_topic"] = dominant_topic
# distribution of topics
df_document_topic.dominant_topic.value_counts()
While giving results, LDA assigns numbers to the topics it finds. If we examine the distribution accordingly, we get the following result.
Here we can see that the most songs are assigned to topic 0. But of course, the fact that the names of the topics are numbers does not mean anything to us. That’s why we need to examine the songs assigned to the topics and determine a name accordingly. To do this, we look at the 15 most used words in each topic. We can represent this as a dataframe.
# Show top n keywords for each topic
def show_topics(vectorizer, lda_model, n_words=20):
keywords = np.array(vectorizer.get_feature_names_out())
topic_keywords = []
for topic_weights in lda_model.components_:
top_keyword_locs = (-topic_weights).argsort()[:n_words]
topic_keywords.append(keywords.take(top_keyword_locs))
return topic_keywords
topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords
We can assign names to topics according to the words here. The names below are what I came up with. You can improve this further. Or these names can be customized according to need.
- Topic 0: “Moving On and Letting Go”
- Topic 1: “Emotional Connections”
- Topic 2: “Dreams and Wonder”
- Topic 3: “Struggles and Determination”
- Topic 4: “Joyful Rhythms and Togetherness”
Post-Processing Data
In post-processing, the results from LDA are combined with the main dataframe. While examining the LDA probability results, those below 40% were allocated to a different dataframe. Since this probability ratio was considered low, a different procedure was applied. In this process, if the probability rate of all topics is 20%, the dominant topic of the relevant song is replaced with “mix”. If not all of them are 20%, the 2 biggest topics with probability value less than 40% were selected and the dominant topic of the relevant song was created by combining these two topics.
You can access the details of this part from the notebook.
Sentiment Analysis
Sentiment analysis is a method for understanding and classifying emotional content in data types such as text, audio, or images. Sentiment analysis uses linguistic and statistical methods to determine the emotional state expressed by a text or a user.
Sentiment analysis aims to identify the emotional tone, emotional reactions or emotional states of users in texts. It is often used to describe the three basic categories of emotions, positive, negative, and neutral.
We will use SentimentIntensityAnalyzer from nltk library.
# Initializing the SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
# Applying sentiment analysis on the 'prep_lyric' column and adding the results as new columns in the DataFrame
result_df['polarity'] = result_df['prep_lyric'].apply(lambda x: analyzer.polarity_scores(x))
result_df = pd.concat(
[result_df.drop(['polarity'], axis=1),
result_df['polarity'].apply(pd.Series)], axis=1)
# Mapping the compound score to sentiment labels ('positive', 'neutral', 'negative')
result_df['sentiment'] = result_df['compound'].apply(lambda x: 'positive' if x > 0 else 'neutral' if x == 0 else 'negative')
# Dropping the intermediate sentiment analysis columns from the DataFrame
result_df.drop(["neg", "neu", "pos", "compound"], axis=1, inplace=True)
# Number of tweets
sns.countplot(y='sentiment',
data=result_df,
palette=['#b2d8d8',"#008080", '#db3d13']
);
According to SentimentIntensityAnalyzer, when we analyze emotions, we can see that positive emotions dominate the most.
NER (Named Entity Recognition)
Named Entity Recognition (NER) is a technique used in natural language processing. NER is used to recognize and label proper names, personal names, organizations, places, dates, currencies and other important entities in texts.
NER uses linguistic and statistical methods to recognize proper nouns or entities that belong to a particular category in a text. These entities are often expressed with words such as nouns, adjectives, verbs, or place names.
You can access the details of this part from the notebook.
Results
The graphs above were examined first in general and then according to the last 10 years.
- In the first two charts, the winning and losing songs were analyzed according to their topics. First of all, when we look at the winning songs, the most winning topic is Struggles and Determination, followed by Joyful Rhythms and Togethernes. Moving On and Letting Go is the theme that lost the most in recording songs, followed by Emotional Connections. Based on these two observations, songs with very emotional and separation-style themes are more likely to lose than others. Secondly, when the graph of the lost songs was examined, it was observed that the songs contained more than one topic. It indicates that the topic of this song is mixed and that songs with two or more such topics are lost. The conclusion to be drawn from this is that the songs should be focused on a certain topic in order to be understandable and influencing the person.
- When the songs are analyzed according to emotion, we can see that the weight of positive songs is high both in general and for the last 10 years. The thing to note here is that songs that can be described as neutral are more likely to lose.
- The themes of the songs in the last 10 years are also examined as they are close to the present. Here, when both the winning and losing songs are examined, the most frequently mentioned subject is Struggles and Determination. Therefore, it may be more risky to use this subject in new songs. It can be continued with the topic Joyful Rhythms and Togetherness.
The two charts above show the analysis of Country/Nationality names from named entity recognition results by winning and losing songs. These charts show that songs with Country/Nationality-related words are more likely to be among the losing songs. For this reason, care should be taken not to mention the name of any nation or country while writing the songs.
In the above graph, similar to the examination of countries and nationalities, this time the names of individuals mentioned in the songs have been taken into account. When examining the graphs, it has been determined that there are slightly more personal names in the losing songs compared to the winning songs. Therefore, while it is not directly recommended to exclude personal names, caution is advised when using them.
In the chart above, the distribution and change of the subjects according to the years are examined. According to this graphic, the subject of Moving on and Letting Go was at the forefront at the beginning, but it has left its place to the subject of Struggles and Determination towards the present day.