Conclusion
Emotional violence
Emotional violence against women is an evident problem in México. Not only do the numbers in the introduction show the dimension of the problem and its increment during the last years, but the problem is also described in communication platforms such as Twitter and surveys such as ENDIREH. Therefore, the objective of the work was to perform an exploratory analysis of the ENDIREH survey accompanied by analyzing a set of tweets labeled as misogynistic or not. The mentioned analysis will respond to a series of questions related to understanding the relationship between emotional and physical violence and whether there is a relation between the characteristics of Mexican women who have suffered psychological violence. Examples of characteristics are age, sexual orientation, house dynamics, their partner’s characteristics, and the aggressor’s profile and characteristics such as income.
The analysis was performed through different data techniques to explore and analyze relations, frequency, patterns, and inferences. Given that no works are focusing specifically on emotional violence towards Mexican women, The aim of this work was, rather than solving a specific hypothesis, to explore the data, generate a better understanding of emotional violence, and, finally, produce more questions.
Data
This work started by collecting the datasets; both were easy to retrieve, the ENDIREH set from the INEGI portal and the tweets through the Twitter API. Next was the cleaning and exploration phase, which became difficult for both datasets. Since the ENDIREH set has much documentation and is well-designed, it does not need an extensive cleaning process. However, it becomes complex to transform and explore since it has around 70,000 entries. Furthermore, the Twitter data was smaller, with 10,000 tweets manually labeled as misogynistic or not. Labeling is a complex task and might be biased based on the person or people doing it; in this work, one person performed it. Another difficulty with this set was the cleaning process. Contrary to the ENDIREH, this one was hard to clean, given that most tweets contained noise, such as URLs, emojis, and characters that could not be parsed. Fortunately, NLP techniques helped solve such problems.
The ENDIREH set was hard to understand during the exploration task, given the complexity mentioned before. However, this step revealed that 90% of women had previously suffered emotional violence. Furthermore, the Twitter dataset was also explored, showing how the dataset containing misogynistic tweets also had specific events in Mexico related to politics and social fights.
Prediction
Moreover, multiple supervised and unsupervised models were trained to explore and predict both sets. Supervised techniques such as Naive Bayes (NB) and Support Vector Machines (SVM) as long as the unsupervised techniques like Association Rule Minning (ARM) were used on the Twitter Data set. The unsupervised method ARM helped notice the difference between non and misogynistic Tweets. The misogynistic Tweets contained words such as “kill” and “alcoholic women,” while non-misogynistic referred equally to women and men and did not have negative verbs; these results served as a baseline to train supervised models such as NB and SVM. The results of these two models were quite similar, with an accuracy of around 80%, where the miss detections happened mostly on “subtle” misogynistic tweets. NB demonstrated better computational efficiency than SVM. With better hardware resources, an SVM or a Natural Network model could be helpful to predict “subtle misogyny,” which is also as important.
On the other hand, for the ENDIREH Dataset, the Decision Trees, Random Forest, and Clustering models were implemented. Although discovering an unbalanced set could cause an overfitted supervised model, the objective of using decision trees and random forests was to visualize the most important factors to predict emotional violence. Therefore, although the random forest method reduced computational time and obtained better accuracy results than decision trees with a total of 97%, the two models demonstrated how the most important factors are related to physical violence in women’s past or actual relation. These results and the results of feature selection were taken into consideration for clustering techniques.
Clustering helped visualize how our data is distributed. For example, without any labels, different clustering algorithms detected almost the same number of cases with no emotional violence, which “demonstrates” how there are relations in the answers of women who have suffered and women who have not. Besides separating our data into two sets, increasing the k number of clusters was also helpful. Visualizing a dendrogram with a higher number of clusters demonstrated how there are subclusters inside the enormous cluster of women who had suffered emotional violence; this means there are significant groups worth analyzing and looking deeper into. Finally, plotting the dots labeled by the cluster while looking into the most critical features to detect emotional violence, such as the aggressor’s income and the number of visits of the couple, supported the importance of those features in the clustering techniques.
Conclusions and future work
In conclusion, each model has different advantages, disadvantages, and porpuses. Through this work, each of them was used to understand better a social problem rather than comparing the model’s results. Through the outcomes, we could see that the design for the ENDIREH survey can be improved to analyze emotional violence and how there are features that have a heavier weight while separating cases of emotional and non-emotional violence. While looking at those features, we can conclude that the characteristics of the aggressors and the women’s couple are essential to detect emotional violence, especially the ones related to income and relation dynamics in terms of visits and communication. Another important finding is the relation between physical and emotional violence, which can be interpreted as how emotional violence “tends” to scale to a physical one. The objective of analyzing social problems should not be to create a “model” capable of detecting violence automatically, given that this can be biased and affect a population. Instead, models can be used and interpreted to understand a problem better.
In future work, it would be interesting to separate the data of women who have not suffered emotional violence to analyze the patterns deeply among women who have, without those outliers. Furthermore, it would be interesting to do multi-class classification to evaluate different levels of emotional violence. Finally, in terms of NLP analyzing women’s answers to open questions to evaluate their sentiments can also be beneficial.