As mentioned in previous tasks the objective for data collection was to obtein as much tweets in spanish from México labeled as misogynistic or not. On the other hand we were also looking for news articles related to femicides.
Collection
Tweets collection
Different datasets were collected from authors of previous works related to misogyny collection. Furthermore, in this work the collection of tweets through tweepy python API was also performed. For those tweets collected in this work around 3 thousand tweets were labeled to finally have a 10,000 labeled tweets balanced between misogynistic or not.
News articles collection
Maria Salguero is a Mexican activist who has collected multiple femicides articles published in Mexico. To have a bast dataset, her set of non-labeled news were collected. In addition, through this work web scrapping was implemented to collect more news articles of femicides in México, out of those collected around 200 were manually labeled making a total dataset of 8,000 articles.
Web scrapping
For the manual collection of this project, an exploration of how to obtain news articles with a high possibility of containing revictimization was performed. One of the essential findings is that most of the revictimizing articles describe “controversial” femicide cases. Unfortunately, most media platforms do not cover issues where little information is given.
The article search was performed with a list of victims’ names, and a web scrapping script was developed to collect the news content.
import pandas as pdfrom bs4 import BeautifulSoupimport requestsart_0 =["https://www.infobae.com/america/mexico/2022/01/01/la-brutal-realidad-de-los-feminicidios-en-mexico-mas-de-10-mujeres-fueron-asesinadas-al-dia-en-2021/","https://www.infobae.com/america/mexico/2021/12/27/mas-de-10-mujeres-asesinadas-al-dia-durante-el-2021-en-mexico/"]art_1 =["https://www.elsoldemexico.com.mx/republica/justicia/mara-habria-sido-asesinada-en-un-motel-de-puebla-fiscalia-254334.html","https://primeravueltanoticias.com/la-otra-version-mara-fue-asesinada-por-su-novio-millonario/"]arts_content =[]arts_label =[]art_dict = {0:art_0,1:art_1}for label, art_list in art_dict.items():for art in art_list:try: r = requests.get(art)print(r)except:print("not able to connect")break soup = BeautifulSoup(r.content, 'html.parser') soup.prettify() content = soup.find_all('p') content_text = [x.text for x in content] arts_content.append(''.join(content_text)) arts_label.append(label)articles_labeled = pd.DataFrame(data = [arts_content,arts_label])articles_labeled.T.head(3)