Data Exploration

ENDIREH Data

Introduction

For ENDIREH’s data analysis, INEGI provides a violence metric estimator, which is calculated upon certain experiences of violence. On past tabs the mentioned estimation was obtained and through this exploration we will visualize the results.

Unfortunately, those estimations are per state and not per person, thus on past tabs a label for emotional violence was obtained. Secondly, this page will show the exploration of the distribution of the data in relation with those labels.

ENDIREH exploration

Throughout the data cleaning process the ENDIREH’s experienced violence metric was obtained for “General” “Emotional” and “Economic” violence. Exploration starts by looking at the moments of each type of violence:

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style
import seaborn as sns
matplotlib.style.use('seaborn-pastel')

violence_df = pd.read_csv("data/violence_clean.csv")

data = np.array([violence_df["est_general"],
violence_df["est_econ"], violence_df["est_emot"]])
transpose = pd.DataFrame(data, columns=violence_df["states"])

violence_df = violence_df.drop(columns=["Unnamed: 0"])

#dropping national metric
violence_df = violence_df[violence_df.states != "Estados Unidos Mexicanos"]
#keeping states values
states=violence_df["states"]

v_gen_emot =  pd.DataFrame(data=[violence_df["est_general"], violence_df["est_emot"], violence_df["est_econ"]], index =["general", "emotional","economic"]).T

df_metrics = pd.DataFrame(data=[v_gen_emot.mean(), v_gen_emot.std(), v_gen_emot.skew(), v_gen_emot.kurtosis()], index =["Mean", 
"Standard Deviation",
"Skew",
"Kurtosis"]).T
df_metrics.head()
Mean Standard Deviation Skew Kurtosis
general 68.756861 5.367687 -1.658182 5.595716
emotional 50.772941 5.026683 -1.634302 4.500386
economic 27.505621 3.313554 -1.437172 3.894335

Mean: the average of experienced violence among states, separated by three violence types. From this value we can see how the economic violence values are smaller compared with the other types of violence

Standard deviation: how close the values are spread about the mean. given that our range goes from 0 to 55 a value between 3 and 5 can be considered small

Skewness: describes the shape of the distribution. the result were negative it means our distribution for all violence types are left-skewed

Kurtosis: measures the peakedness or flatness of a distribution. the result was around 4 and 5 for all violences and a positive kurtosis indicates a thin pointed distribution. Meaning we may have a small amount of outliers

The described values show the distribution of our data, from them we can expect a normal distribution with similar values, the next visualizations will help to observe the difference between the three distributions

Code
f, ax = plt.subplots(dpi=120)
sns.histplot(data=v_gen_emot["general"], kde=True, linewidth=0.5)
ax.set_title("General Violence", fontsize=16)
ax.set_ylabel("Frequency of violence estimation", fontsize=16)
ax.set_xlabel("Violence estimation units", fontsize=16)
Text(0.5, 0, 'Violence estimation units')

This plot shows the frequency of the the violence estimation, it describes how most of the states tend to have a general violence estimation around 70 units.

Code
f, ax = plt.subplots(dpi=120)
sns.histplot(data=v_gen_emot, kde=True, linewidth=0.5)
ax.set_title("Violence per type", fontsize=16)
ax.set_ylabel("Frequency of violence estimation", fontsize=16)
ax.set_xlabel("Violence estimation units", fontsize=16)
Text(0.5, 0, 'Violence estimation units')

By looking at the three values, the violence estimation for emotional and general or overall violence are closer to a higher range of violence and have a similar distribution while the economical value has a smaller range of violence and a higher frequency of values around 30 units of violence.

Since economic violence has a broader definition than the one analyzed on this work, and it’s behavior is not similar to the types of violence that are studied it will be removed.

Code
f, ax = plt.subplots(dpi=120)


sns.boxplot(data=[violence_df["est_emot"], violence_df["est_general"]],color = 'orange')
ax.set_title("Emotional Violence estimator", fontsize=16)
ax.set_xlabel("Violence type", fontsize=14)
ax.set_ylabel("General Violence estimator", fontsize=16)
ax.set_xticklabels(["Emotional", "General"])

sns.despine()

Focusing on the emotional and general violence distribution, as seen on the past plot, the emotional violence mean is around 50 units with outliers around the 40 unit value. Meanwhile, General violence mean goes up to 70 units, with more sparse outliers.

The next visualization will help us understand those outliers.

Code
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
from seaborn_qqplot import pplot
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style
matplotlib.style.use('ggplot')

f, (ax1, ax2) = plt.subplots(ncols=2, dpi=150, )

sns.barplot(x='states',
            y='est_general',
            data=violence_df,
            order=violence_df.sort_values('est_general').states,
            color = 'orange',
            ax=ax1,)
ax1.set(ylim=(40, 80))
ax1.set_ylabel("Violence estimator", fontsize=12)
ax1.set_xlabel("Mexico States", fontsize=12)
ax1.set_title("General violence per state", fontsize=12)

sns.barplot(x='states',
            y='est_emot',
            data=violence_df,
            order=violence_df.sort_values('est_emot').states,
            color = 'violet',
            ax=ax2)
ax2.set(ylim=(40, 80))
ax2.set_ylabel("", fontsize=16)
ax2.set_xlabel("Mexico States", fontsize=12)
ax2.set_title("Emotional violence per state", fontsize=12)

ax1.set_xticklabels(states,rotation=90, fontsize=5)
ax2.set_xticklabels(states,rotation=90, fontsize=5)
[Text(0, 0, 'Aguascalientes'),
 Text(1, 0, 'Baja California'),
 Text(2, 0, 'Baja California Sur'),
 Text(3, 0, 'Campeche'),
 Text(4, 0, 'Coahuila de Zaragoza'),
 Text(5, 0, 'Colima'),
 Text(6, 0, 'Chiapas'),
 Text(7, 0, 'Chihuahua'),
 Text(8, 0, 'Ciudad de México'),
 Text(9, 0, 'Durango'),
 Text(10, 0, 'Guanajuato'),
 Text(11, 0, 'Guerrero'),
 Text(12, 0, 'Hidalgo'),
 Text(13, 0, 'Jalisco'),
 Text(14, 0, 'Estado de México'),
 Text(15, 0, 'Michoacán de Ocampo'),
 Text(16, 0, 'Morelos'),
 Text(17, 0, 'Nayarit'),
 Text(18, 0, 'Nuevo León'),
 Text(19, 0, 'Oaxaca'),
 Text(20, 0, 'Puebla'),
 Text(21, 0, 'Querétaro'),
 Text(22, 0, 'Quintana Roo'),
 Text(23, 0, 'San Luis Potosí'),
 Text(24, 0, 'Sinaloa'),
 Text(25, 0, 'Sonora'),
 Text(26, 0, 'Tabasco'),
 Text(27, 0, 'Tamaulipas'),
 Text(28, 0, 'Tlaxcala'),
 Text(29, 0, 'Veracruz de Ignacio de la Llave'),
 Text(30, 0, 'Yucatán'),
 Text(31, 0, 'Zacatecas')]

The bar-plot is showing the Violence unit value per state by violence type. It is interesting how the violence behaves the same among states for general and emotional violence, based on this we can develop a hypothesis to understand if those two are related. On a real life example, the probability of emotional violence occurring once physical violence occurs is expected.

Finally we will visualize the balance of the labeled set that was previously created:

Code
import pandas as pd
df_endireh = pd.read_csv('data/endireh_ev.csv')
print(df_endireh.shape)
df_endireh.head()
(73500, 354)
P12_1 P12_2 P12_3 P12_4 P12_5 P12_6 P12_7 P12_8 P12_9 P12_10 ... P4_13_4 P4_13_5 P4_13_6 P4_13_7 FAC_VIV_y.1 FAC_MUJ_y.1 ESTRATO_y.1 UPM_DIS_y.1 EST_DIS_y.1 label
0 2 2 3 1 1 1 1 2.0 2.0 1.0 ... 1.0 NaN NaN NaN 113 113 4 1 3 0.0
1 2 1 2 3 1 3 3 8.0 8.0 3.0 ... 1.0 NaN NaN NaN 113 113 4 1 3 1.0
2 1 1 3 3 3 3 3 8.0 8.0 3.0 ... NaN NaN NaN NaN 113 227 4 1 3 0.0
3 1 1 3 3 3 1 3 1.0 2.0 4.0 ... 5.0 NaN NaN NaN 113 113 4 1 3 0.0
4 2 1 1 3 3 3 1 3.0 8.0 1.0 ... 2.0 NaN NaN NaN 78 155 2 2 1 0.0

5 rows × 354 columns

As we can see the endireh survey has a shape of 73,500 answers and 354 questions where each question is represented by numbers and a sub-index which represents relations between questions.

It would be complex to go manually through all the features thus the purpose of next tabs is to use different models to understand how they statistically relate to each other.

Code
def check_balance(df):
    return (df.label.value_counts())
balance = check_balance(df_endireh)

bal_plot = {"emotional violence":balance[0],
"no emotional violence":balance[1]}
keys = list(bal_plot.keys())
vals = [bal_plot[k] for k in keys]

f,ax = plt.subplots(dpi=150,)
sns.barplot(x=keys, y=vals,
            color = 'orange')
ax.set_ylabel("Number of cases", fontsize=12)
ax.set_xlabel("Classes", fontsize=12)
ax.set_title("Emotional violence", fontsize=12)
Text(0.5, 1.0, 'Emotional violence')

Our check balance functions helps to visualize how many datapoints are contained on each class. The two classes (0 no emotional violence and 1 emotional violence) are unbalanced having a bigger number of women that have gone through emotional violence.

Twitter Data

A small step of exploration was done before jumping into the cleaning portion, manually it was easy to notice that the majority of tweets had noise, since many latin characters such as ñ or á ó were not obtained correctly. Thus, an extra step was added to the cleaning phase to remove those.

Afterwards the tweets_exploration.py and tweets_exploration.r scripts were used to obtain, the frequency of the words after being cleaned, finally both are able to generate a wordcloud

Code


words <- sort(rowSums(tdm), decreasing = TRUE) # count all occurrences of each word and group them
df <- data.frame(word = names(words), freq = words) # convert it to a dataframe
head(df) # visualize!
set.seed(1234) # for reproducibility
wcloud <- wordcloud2(df,   # generate word cloud
                     size = 1.5,
                     color= 'random-dark', # set colors
                     #shape = 'pentagon',
                     rotateRatio = 0) #horizontal
wcloud
"\nwords <- sort(rowSums(tdm), decreasing = TRUE) # count all occurrences of each word and group them\ndf <- data.frame(word = names(words), freq = words) # convert it to a dataframe\nhead(df) # visualize!\nset.seed(1234) # for reproducibility\nwcloud <- wordcloud2(df,   # generate word cloud\n                     size = 1.5,\n                     color= 'random-dark', # set colors\n                     #shape = 'pentagon',\n                     rotateRatio = 0) #horizontal\nwcloud\n"

The wordcloud provides an insight in the words of the dataset and their frequency, but it also helps to visualize some other contexts. For example while capturing these data a political movement occured in México thus the words “consulta popular” appear in the WordCloud. Language can provide representation not only about sentiment but also about the understanding of an important event.

After looking at general metrics an exploration step was performed through multiple visualizations. The most important fact to understand through them is if there are any extra steps needed to be performed through the “cleaning” phase, over again.