Association rule mining (ARM)

Introduction

The objective of using the unsupervised ARM task in this page is to do an exploratory analysis of the tweets set labeled as misogynistic or not to understand and analyze the relations of co-occurrence and patterns among the texts. Although ARM task is an unsupervised one, the labels of the set will be taken as an advantage to compare it as two different sets.

Model

The Association rule mining is an unsupervised rule-based machine learning method. This rule shows how frequently a item set occurs in a transaction and its objective is to discover statistically relevant relationships between variables in large sets. To anticipate the occurrence of an item based on the occurrences of other items in the training data, find rules that will work with a particular group of transactions in a set. Since market basket analysis was the initial application of association mining, association rule mining is occasionally referred to as “market basket analysis.”

The ARM works with the “Apriori” algorithm, the rules are calculated from itemsets, which are made up of two or more items. The algorithm identifies the most frequent individual items in the set, extending to obtain more and more relations or itemset. Extending the relations might be negative since it can lead to having a large number of rules that could make the model computationally inefficient.

Hyperparameter

The threshold or min-support can be used to reduce the number of possible rules to obtain the most important ones or “remove” the least important, improving the model’s efficiency.

Evaluation Metrics

To evaluate how an itemset or relation that contains A and B can be more frequent or occurs more strictly than other three evaluation metrics will be taken in consideration:

Support: How often do items A and B occur together relative to all other transactions? It measures how common the occurrence is from 1 to 0.
Confidence: How often items in A and items in B occur together, relative to transactions that contain A. Measures how statistically strict a rule is from 1 to 0.
Lift: Ratio of the observed support to that expected if X and Y were independent. It measures if those are independent, negatively, or positively related.

Code

import nltk
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer

import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from apyori import apriori
import networkx as nx 
#import download

nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\valer\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

True

Reading and cleaning the tweets. Three sets will be generated, one with cleaning techniques, a second one without stop-words and a third one applying stemming

Code

from nltk.tokenize import TweetTokenizer
import pandas as pd
import numpy as np
import re
from nltk.stem.snowball import SnowballStemmer

Code

#obtaining tweets as df
df = pd.read_csv("data/clean_set.csv")
clean_text = []

replace = [
        (r"<a[^>]*>(.*?)</a>", " url"),
        (r"@[a-zA-Z0-9_]{0,15}", " user"),
        (r"_[a-zA-Z0-9_]{0,15}", " user"),
        (r"#\w*[a-zA-Z]\w*", " hashtag"),
        (r"(?<=\d),(?=\d)", ""),
        (r"\d+", "number"),
        (r"[\t\n\r\*\.\@\,\-\/]", " "),
        (r"\s+", " "),
        (r'[^\w\s]', ''),
        (r'/(.)(?=.*\1)/g', ""),
        (r'http[^\s]*',""),
        (r'(.*)\bt\b(.*)',""),
        (r'(.*)\bco\b(.*)',"")
    ]

#df[text] is column with each row is a tweet in string
for text in df["text"]:
        for rp in replace:
                text = re.sub(rp[0],rp[1], text)
        clean_text.append(text.lower())

#clean text is now a list with clean tweets
stemmer = SnowballStemmer("spanish")
tk = TweetTokenizer()
stops = set(stopwords.words('spanish'))

#transactions list of lists
transactions = [tk.tokenize(x) for x in clean_text]
transactions_sw_st= []
transactions_sw= []
for text in transactions:
        transactions_sw_st.append([stemmer.stem(x) for x in text if x not in stops])
        transactions_sw.append([x for x in text if x not in stops])

Functions to plot the network and obtain the metrics

Code

def reformat_results(results):

  keep=[]
  for i in range(0,len(results)):
    for j in range(0,len(list(results[i]))):
      if (j>1):
        for k in range(0,len(list(results[i][j]))):
          if(len(results[i][j][k][0])!=0):
            rhs=list(results[i][j][k][0])
            lhs=list(results[i][j][k][1])
            conf=float(results[i][j][k][2])
            lift=float(results[i][j][k][3])
            keep.append([rhs,lhs,supp,conf,supp*conf,lift])

      if(j==1):
        supp=results[i][j]

  return pd.DataFrame(keep, columns=['rhs','lhs','supp','conf','supp x conf','lift'])
  
def convert_to_network(df):
    #BUILD GRAPH
    G = nx.DiGraph()  # DIRECTED
    for row in df.iterrows():
        # for column in df.columns:
        lhs="_".join(row[1][0])
        rhs="_".join(row[1][1])
        conf=row[1][3]; #print(conf)
        if(lhs not in G.nodes): 
            G.add_node(lhs)
        if(rhs not in G.nodes): 
            G.add_node(rhs)

        edge=(lhs,rhs)
        if edge not in G.edges:
            G.add_edge(lhs, rhs, weight=conf)

    # print(G.nodes)
    # print(G.edges)
    return G

def plot_network(G,title):
    #SPECIFIY X-Y POSITIONS FOR PLOTTING
    pos=nx.random_layout(G)

    #GENERATE PLOT
    fig, ax = plt.subplots(dpi=120,)
    fig.set_size_inches(15, 15)

    #assign colors based on attributes
    weights_e   = [G[u][v]['weight'] for u,v in G.edges()]

    #SAMPLE CMAP FOR COLORS 
    cmap=plt.cm.get_cmap('PuRd')
    colors_e    = [cmap(G[u][v]['weight']*10) for u,v in G.edges()]

    #PLOT
    nx.draw(
    G,
    edgecolors="white",
    edge_color=colors_e,
    node_size=2000,
    linewidths=2,
    font_size=8,
    node_color="pink",
    font_color="black",
    width=weights_e,
    with_labels=True,
    pos=pos,
    ax=ax
    )
    ax.set_title(title, fontsize=20)
    plt.show()

Experimenting with different support values. The bigger the value, the smaller the number of connections to obtain.

Code

results = list(apriori(transactions,min_support = 0.1,
min_confidence=.03,min_length=1,max_length=5))

pd_results=reformat_results(results)
G=convert_to_network(pd_results)
plot_network(G, 'Network of words in tweets')
print(pd_results[0:5])

     rhs    lhs      supp      conf  supp x conf      lift
0    [a]   [la]  0.102303  0.420147     0.042982  1.284260
1   [la]    [a]  0.102303  0.312710     0.031991  1.284260
2    [a]  [que]  0.109582  0.450041     0.049316  1.270681
3  [que]    [a]  0.109582  0.309403     0.033905  1.270681
4   [de]   [el]  0.102902  0.323714     0.033311  1.514945

All of the connections generated are stop words in spanish, thus the importance of removing them.

Code

results = list(apriori(transactions_sw,min_support = 0.001,
min_confidence=.03,min_length=1,max_length=5))

pd_results=reformat_results(results)
G=convert_to_network(pd_results)
plot_network(G, 'Network of words in tweets')
print(pd_results[0:10])

           rhs          lhs      supp      conf  supp x conf       lift
0         [ah]         [si]  0.001396  0.229508     0.000320   2.977668
1      [ahora]         [ms]  0.001097  0.102804     0.000113   1.397044
2      [ahora]         [si]  0.001496  0.140187     0.000210   1.818803
3  [alcohlica]     [pinche]  0.002094  0.875000     0.001832  51.019622
4     [pinche]  [alcohlica]  0.002094  0.122093     0.000256  51.019622
5    [alguien]         [si]  0.001197  0.173913     0.000208   2.256370
6       [amor]         [ms]  0.001695  0.108280     0.000184   1.471467
7       [amor]      [mujer]  0.001795  0.114650     0.000206   1.063665
8       [amor]         [si]  0.001396  0.089172     0.000124   1.156929
9       [amor]       [vida]  0.001097  0.070064     0.000077   2.631718

After removing the stopwords more significant words are showing. Having a small support generates too many connections which are hard to appreciate through the plot. Therefore the number will be increased.

Code

results = list(apriori(transactions_sw,min_support = 0.004,
min_confidence=.03,min_length=1,max_length=5))

pd_results=reformat_results(results)
G=convert_to_network(pd_results)
plot_network(G, 'Network of words in tweets')
print(pd_results[0:5])

        rhs       lhs      supp      conf  supp x conf       lift
0     [aos]  [number]  0.004387  0.611111     0.002681  34.431648
1  [number]     [aos]  0.004387  0.247191     0.001084  34.431648
2   [gusta]      [si]  0.004487  0.140187     0.000629   1.818803
3      [si]   [gusta]  0.004487  0.058215     0.000261   1.818803
4   [gusta]    [user]  0.004587  0.143302     0.000657   2.245590

Now we can see how the dataset connections are related to gender pronouns such as “hombre” man and “mujer” women and their plurals. The stemming set will be used to solve the problem of grouping plurals and singulars.

The support and confidence also dropped since the stopwords appear more frequently together.

Code

results = list(apriori(transactions_sw_st,min_support = 0.004,
min_confidence=.03,min_length=1,max_length=5))

pd_results=reformat_results(results)
G=convert_to_network(pd_results)
plot_network(G, 'Network of words in tweets')
print(pd_results[0:5])

      rhs      lhs      supp      conf  supp x conf       lift
0   [aos]   [numb]  0.004387  0.611111     0.002681  34.431648
1  [numb]    [aos]  0.004387  0.247191     0.001084  34.431648
2  [bien]     [si]  0.004088  0.132258     0.000541   1.715933
3    [si]   [bien]  0.004088  0.053040     0.000217   1.715933
4   [cag]  [pinch]  0.005185  0.490566     0.002544  17.384759

Stemming did not perfom as good since the words are “truncated” thus the nouns in singular such as mujer was reduced to muj and plural to mujer. The lemmatizing technique should solve this problem. Unfortunately different works have shown recent packages do not perform well on spanish texts.

Misogynistic vs non-misogynistic

Finally, a comparison between the networks of misogynistic and non-misogynistic tweets will be performed.

Code

#separating data by label
no_miso_indx = df.index[df['label'] == 0].tolist()
miso_indx = df.index[df['label'] == 1].tolist()

transactions_miso= [transactions_sw[x] for x in miso_indx]
transactions_no_miso= [transactions_sw[x] for x in no_miso_indx]

Code

results = list(apriori(transactions_miso,min_support = 0.004,
min_confidence=.03,min_length=1,max_length=5))

pd_results=reformat_results(results)
G=convert_to_network(pd_results)
plot_network(G, 'Network of words in misogynistic tweets')
print(pd_results[0:10])
miso_results=pd_results

           rhs          lhs      supp      conf  supp x conf       lift
0  [alcohlica]     [pinche]  0.004190  0.954545     0.003999  41.601581
1     [pinche]  [alcohlica]  0.004190  0.182609     0.000765  41.601581
2       [bien]      [mujer]  0.004190  0.147887     0.000620   0.835638
3       [casa]      [mujer]  0.004988  0.242718     0.001211   1.371482
4         [da]      [mujer]  0.004988  0.384615     0.001918   2.173272
5       [debe]      [mujer]  0.006385  0.432432     0.002761   2.443463
6      [mujer]       [debe]  0.006385  0.036077     0.000230   2.443463
7      [deben]    [mujeres]  0.004190  0.777778     0.003259   7.554694
8    [mujeres]      [deben]  0.004190  0.040698     0.000171   7.554694
9        [den]      [gusta]  0.004190  0.700000     0.002933  28.293548

Code

results = list(apriori(transactions_no_miso,min_support = 0.004,
min_confidence=.03,min_length=1,max_length=5))

pd_results=reformat_results(results)
G=convert_to_network(pd_results)
plot_network(G, 'Network of words in non-misogynistic tweets')
print(pd_results[0:10])
no_miso_results=pd_results

         rhs        lhs      supp      conf  supp x conf        lift
0      [aos]   [number]  0.005980  0.600000     0.003588   23.155385
1   [number]      [aos]  0.005980  0.230769     0.001380   23.155385
2  [atencin]   [llamar]  0.004186  0.600000     0.002511  120.408000
3   [llamar]  [atencin]  0.004186  0.840000     0.003516  120.408000
4    [cagan]  [pinches]  0.006777  0.723404     0.004902   49.044853
5  [pinches]    [cagan]  0.006777  0.459459     0.003114   49.044853
6    [gusta]       [si]  0.005780  0.147208     0.000851    1.884039
7       [si]    [gusta]  0.005780  0.073980     0.000428    1.884039
8    [gusta]     [user]  0.006976  0.177665     0.001239    1.617686
9     [user]    [gusta]  0.006976  0.063521     0.000443    1.617686

Results

A difference of connections is observed between the networks. In general the set contains tweets where the words man and women mostly appear together which might be related to comparisons either positive or negative among the two genders.

On the other hand misogynistic and non-misogynistic tweets have different types of connections. Both with a total variance of 0.004 had a similar number of relations, around 50.

Code

print(len(miso_results))
print(len(no_miso_results))

51
48

Code

miso_results = miso_results.sort_values(by=['supp'], ascending=False)
print(miso_results[0:10])

          rhs        lhs      supp      conf  supp x conf       lift
14   [hombre]    [mujer]  0.019753  0.512953     0.010132   2.898447
15    [mujer]   [hombre]  0.019753  0.111612     0.002205   2.898447
34    [mujer]      [ser]  0.019154  0.108230     0.002073   1.845064
35      [ser]    [mujer]  0.019154  0.326531     0.006254   1.845064
18  [mujeres]  [hombres]  0.014964  0.145349     0.002175   4.186715
17  [hombres]  [mujeres]  0.014964  0.431034     0.006450   4.186715
36    [mujer]       [si]  0.014565  0.082300     0.001199   1.082643
37       [si]    [mujer]  0.014565  0.191601     0.002791   1.082643
46    [vieja]   [pinche]  0.009777  0.526882     0.005151  22.962880
45   [pinche]    [vieja]  0.009777  0.426087     0.004166  22.962880

Code

no_miso_results = no_miso_results.sort_values(by=['supp'], ascending=False)
print(no_miso_results[0:10])

         rhs       lhs      supp      conf  supp x conf      lift
47    [user]      [si]  0.011162  0.101633     0.001134  1.300752
46      [si]    [user]  0.011162  0.142857     0.001595  1.300752
15   [mujer]  [hombre]  0.007574  0.195876     0.001484  4.549590
14  [hombre]   [mujer]  0.007574  0.175926     0.001333  4.549590
27     [ser]      [ms]  0.007375  0.117460     0.000866  1.220080
26      [ms]     [ser]  0.007375  0.076605     0.000565  1.220080
30      [ms]    [user]  0.007176  0.074534     0.000535  0.678653
31    [user]      [ms]  0.007176  0.065336     0.000469  0.678653
37    [user]  [number]  0.006976  0.063521     0.000443  2.451417
29      [si]      [ms]  0.006976  0.089286     0.000623  0.927425

Code

import seaborn as sns
rels = (len(miso_results["rhs"]))
relation = [str(miso_results["rhs"][x])+","+str(miso_results["lhs"][x]) for x in range(0,rels)]
miso_results["relation"] = [x.replace("'","").replace("[","").replace("]","") for x in relation]
miso_results = miso_results.rename(columns={'supp': 'support'})

sns.set_theme(style="whitegrid", palette=sns.color_palette("hls", 8))
fig, ax = plt.subplots(dpi=120,)
ax.set_title("rules of misogynistic tweets", fontsize=18)
ax.set_ylabel("support",fontsize=14)
plt.xticks(rotation = 90)
sns.barplot(data=miso_results[0:10], x="relation", y="support")

rels = (len(no_miso_results["rhs"]))
relation = [str(no_miso_results["rhs"][x])+","+str(no_miso_results["lhs"][x]) for x in range(0,rels)]
no_miso_results["relation"] = [x.replace("'","").replace("[","").replace("]","") for x in relation]
no_miso_results = no_miso_results.rename(columns={'supp': 'support'})

sns.set_theme(style="whitegrid", palette=sns.color_palette("hls", 8))
fig, ax = plt.subplots(dpi=120,)
ax.set_title("rules of non-misogynistic tweets", fontsize=18)
ax.set_ylabel("support",fontsize=14)
plt.xticks(rotation = 90)
sns.barplot(data=no_miso_results[0:10], x="relation", y="support")

<AxesSubplot:title={'center':'rules of non-misogynistic tweets'}, xlabel='relation', ylabel='support'>

After looking at the relations with higher support, there is a clear difference in words between misogynistic and non-misogynistic tweets.

The relation with higher support for misogynistic tweets was “deber” and “mujeres,” which is translated to women must, so there is a pattern of saying what women should do in misogynistic tweets. The second highest support was for the relation “alcoholic” and “pinche” it is essential to remark that alcoholic is used as a noun and has an “a” termination which means it is referring to an alcoholic female; the word pinche is a slur used as an adjective to emphasize the noun, with this information the assumption is that misogynistic tweets frequently refer to alcoholic women. Finally, a significant relation not shown in the graph is “voy” and “matar,” meaning “Will” and “kill” having tweets with the such combination can be a threat or physicological violence directly to a woman, which demonstrates the importance of this work.

On the other hand, the nonmisogynistic tweets also have relations that might sound misogynistic such as “pinches” and “cagan,” which translates to an adjective and a verb used as a slur to demonstrate being angry.

The ARM model explores a set of data that specially measure how frequently bigrams appear together and how strict that appearance is. It is also important to remember that the appearance of bigrams does not mean a tweet can be classified as misogynistic; for example, if a tweet states, “he said I will kill you,” there is an appearance of the words “will” and “kill” but the tweet is not violent neither misogynistic, although it is talking about violence. Therefore techniques such as ARM should be accompanied by a profound interpretation, and prediction should be performed with a classifier with higher relation techniques—t, negatively, or positively related.