Data cleaning

Implementation

The text will be cleaned through the standard NLP cleaning task, such as punctuation, numbers, hypertext and stop-words removals, and lowercase transformation for all characters. Based on the state-of-the-art results, lemmatization and stemming techniques have not improved the performance of sentiment classification in texts in Spanish; thus, multiple sets will be generated to test the effects of those techniques.

Tweets cleaning

We have our original dataset labeled_set.csv with 2 columns, “text” and “label”, “text” column is the tweets we have gathered, and value 0 and 1 in “label” column represent the non misogynistic tweet and misogynistic tweet respetively.

Right now the dataset is unuseble because there are way to many “noises” in our dataset, i.e. there are a lot of special characters, emojis, and website links in our text data, so we need to clean (remove) these “noises” before doing further analysis.

Here are the data cleaning process step by step:

Import necessary libraries and packages

import pandas as pd
import numpy as np

import nltk; 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

import matplotlib.pyplot as plt
%matplotlib inline  
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

from nltk.sentiment import SentimentIntensityAnalyzer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Load the original dataset

tweets= pd.read_csv('../code/labeled_set.csv', encoding = "ISO-8859-1")
tweets.head(10)

	text	label
0	?? no sé si das más pena tú o tu terrible ort...	0
1	Cuando todo se va al infierno, la gente que e...	0
2	- En 1800 nace #NatTurner, el esclavo rebelde ...	0
3	entre muchas otras Muere el 6 de nov 2015 ??...	0
4	era la maldición de muchas familias.	0
5	https://t.co/I8JcEowZed #FelizLunes #FelizLun...	0
6	La rola de #LaLocaDelSenado	0
7	oh Dios mío!, desde cuando tienen eso ?	0
8	sexistas y totalmente inadecuados para su eda...	0
9	- Te ves bien crudo, ¿estuvo bueno el fin? - N...	0

Remove empty space in columns if there is any. Rename the column name and check the data type of each column

tweets.columns = tweets.columns.str.replace(' ', '')
tweets.rename(columns={'text': 'tweets'}, inplace=True)
tweets.dtypes
tweets.head(5)

	tweets	label
0	?? no sé si das más pena tú o tu terrible ort...	0
1	Cuando todo se va al infierno, la gente que e...	0
2	- En 1800 nace #NatTurner, el esclavo rebelde ...	0
3	entre muchas otras Muere el 6 de nov 2015 ??...	0
4	era la maldición de muchas familias.	0

Remove the “noises”

tweets['tweets'] = tweets['tweets'].str.replace('[^\w\s]', '') # Remove all special symbols & characters
tweets['tweets'] = tweets['tweets'].str.replace('_', '') # Remove all underscores
tweets['tweets'] = tweets['tweets'].str.replace('http[^\s]*',"") # Remove all words that start with "http"
tweets['tweets'] = tweets['tweets'].astype(str).str.lower() # Make all words in lower case
tweets.head(10)

C:\Users\valer\AppData\Local\Temp\ipykernel_19656\154057916.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  tweets['tweets'] = tweets['tweets'].str.replace('[^\w\s]', '') # Remove all special symbols & characters
C:\Users\valer\AppData\Local\Temp\ipykernel_19656\154057916.py:3: FutureWarning: The default value of regex will change from True to False in a future version.
  tweets['tweets'] = tweets['tweets'].str.replace('http[^\s]*',"") # Remove all words that start with "http"

	tweets	label
0	no sé si das más pena tú o tu terrible ortog...	0
1	cuando todo se va al infierno la gente que es...	0
2	en 1800 nace natturner el esclavo rebelde del...	0
3	entre muchas otras muere el 6 de nov 2015	0
4	era la maldición de muchas familias	0
5	felizlunes felizlunesatodos linux linuxsecur...	0
6	la rola de lalocadelsenado	0
7	oh dios mío desde cuando tienen eso	0
8	sexistas y totalmente inadecuados para su eda...	0
9	te ves bien crudo estuvo bueno el fin no es ...	0

Save the clean dataset

# tweets.to_csv('cleaned_tweets.csv')
# tweets.to_html('cleaned_tweets.html', classes='table table-stripped')

Now we have the clean dataset so we can do some further analysis and some EDA.
Tokenization

from nltk.tokenize import RegexpTokenizer

regexp = RegexpTokenizer('\w+')

tweets['tweets_token']=tweets['tweets'].apply(regexp.tokenize)
tweets.head(10)

	tweets	tweets_token
0	no sé si das más pena tú o tu terrible ortog...	[no, sé, si, das, más, pena, tú, o, tu, terrib...
1	cuando todo se va al infierno la gente que es...	[cuando, todo, se, va, al, infierno, la, gente...
2	en 1800 nace natturner el esclavo rebelde del...	[en, 1800, nace, natturner, el, esclavo, rebel...
3	entre muchas otras muere el 6 de nov 2015	[entre, muchas, otras, muere, el, 6, de, nov, ...
4	era la maldición de muchas familias	[era, la, maldición, de, muchas, familias]
5	felizlunes felizlunesatodos linux linuxsecur...	[felizlunes, felizlunesatodos, linux, linuxsec...
6	la rola de lalocadelsenado	[la, rola, de, lalocadelsenado]
7	oh dios mío desde cuando tienen eso	[oh, dios, mío, desde, cuando, tienen, eso]
8	sexistas y totalmente inadecuados para su eda...	[sexistas, y, totalmente, inadecuados, para, s...
9	te ves bien crudo estuvo bueno el fin no es ...	[te, ves, bien, crudo, estuvo, bueno, el, fin,...

Remove infrequent words. We first change the format of tweets_token to strings and keep only words which are no shorter than 2 letters

tweets['tweets_string'] = tweets['tweets_token'].apply(lambda x: ' '.join([item for item in x if len(item)>=2]))
tweets.head(10)

	tweets	tweets_token	tweets_string
0	no sé si das más pena tú o tu terrible ortog...	[no, sé, si, das, más, pena, tú, o, tu, terrib...	no sé si das más pena tú tu terrible ortografí...
1	cuando todo se va al infierno la gente que es...	[cuando, todo, se, va, al, infierno, la, gente...	cuando todo se va al infierno la gente que est...
2	en 1800 nace natturner el esclavo rebelde del...	[en, 1800, nace, natturner, el, esclavo, rebel...	en 1800 nace natturner el esclavo rebelde del ...
3	entre muchas otras muere el 6 de nov 2015	[entre, muchas, otras, muere, el, 6, de, nov, ...	entre muchas otras muere el de nov 2015
4	era la maldición de muchas familias	[era, la, maldición, de, muchas, familias]	era la maldición de muchas familias
5	felizlunes felizlunesatodos linux linuxsecur...	[felizlunes, felizlunesatodos, linux, linuxsec...	felizlunes felizlunesatodos linux linuxsecurit...
6	la rola de lalocadelsenado	[la, rola, de, lalocadelsenado]	la rola de lalocadelsenado
7	oh dios mío desde cuando tienen eso	[oh, dios, mío, desde, cuando, tienen, eso]	oh dios mío desde cuando tienen eso
8	sexistas y totalmente inadecuados para su eda...	[sexistas, y, totalmente, inadecuados, para, s...	sexistas totalmente inadecuados para su edad e...
9	te ves bien crudo estuvo bueno el fin no es ...	[te, ves, bien, crudo, estuvo, bueno, el, fin,...	te ves bien crudo estuvo bueno el fin no es qu...

Create a list of all words

all_words = ' '.join([word for word in tweets['tweets_string']])

Tokenize all_words

tokenized_words = nltk.tokenize.word_tokenize(all_words)

Create a frequency distribution which records the number of times each word has occurred:

from nltk.probability import FreqDist
fdist = FreqDist(tokenized_words)
fdist

FreqDist({'que': 5199, 'la': 4773, 'de': 4715, 'no': 3236, 'el': 2861, 'es': 2462, 'en': 1967, 'una': 1720, 'las': 1680, 'se': 1599, ...})

Now we can use our fdist dictionary to drop words which occur less than a certain amount of times (usually we use a value of 3 or 4).

tweets['tweets_string_fdist'] = tweets['tweets_token'].apply(lambda x: ' '.join([item for item in x if fdist[item] >= 4 ]))

Lemmatization

nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
wordnet_lem = WordNetLemmatizer()
tweets['tweets_string_lem'] = tweets['tweets_string_fdist'].apply(wordnet_lem.lemmatize)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...

Check if the columns are equal so we might not need to do the lemmatization

tweets['is_equal']= (tweets['tweets_string_fdist']==tweets['tweets_string_lem'])
tweets.is_equal.value_counts()

True     10025
False        4
Name: is_equal, dtype: int64

Let’s create a word cloud to see what are the most frequent words

all_words_lem = ' '.join([word for word in tweets['tweets_string_lem']])

%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(width=600, 
                     height=400, 
                     random_state=2, 
                     max_font_size=100).generate(all_words_lem)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off');

Frequency distributions

nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

words = nltk.word_tokenize(all_words_lem)
fd = FreqDist(words)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!

We can list the top 15 most frequent words

fd.most_common(15)
fd.tabulate(15)

 que   la   de   no   el   es   en  una  las   se  los   me   un   lo  por 
5199 4776 4715 3236 2861 2462 1967 1720 1677 1599 1488 1357 1322 1258 1167

Now we can make a plot of the most frequent words

top_30 = fd.most_common(30)

# Create pandas series to make plotting easier
fdist = pd.Series(dict(top_30))
import seaborn as sns
sns.set_theme(style="ticks")

tweets_top30_word_barplot = sns.barplot(y=fdist.index, x=fdist.values, color='pink')