Data Gathering

ENDIREH Data

For an gathering our data this project focuses on multiple sections of the survey which contain the TRUE or FALSE answers to experienced violent situations. The largest section “TVIV” is composed by 122,646 objects with 35 indicators each. The objects are the answers or what we call “rows” of data and the 35 indicators are questions to violent situations or “columns”.

As mentioned before, the indicator is a boolean variable that, based on an ENDIREH key dictionary, it will indicate if the answer to that question elevates the experienced violence metric of the person answering the survey.

INEGI has provided ENDIREH results on multiple type files such as CSV, DTA, DBF, SAV, RData. All of them that can be downloaded from their webpage INEGI. For this project the Rstat data type was downloaded to collect the information.

Code


  install.packages("survey")
  library(survey) 
  
  load("../bd_endireh_2021.RData")
  tiv <- data.frame(TB_SEC_IVaVD)
  tiv <- data.frame(TB_SEC_IVaVD)
  data_new <- tiv[ , colSums(is.na(tiv)) < nrow(tiv)]
  data <- data_new[vapply(data_new, function(x) length(unique(x)) > 1, logical(1L))]

INEGI also provides a manual to understand and manipulate the data set.

As mentioned before, the dataset is composed by 28 tables with all the information retrieved by the survey. All the information is presented by the acts of violence experienced in the school, work, community and family settings. Tables are labeled with a combination of letters related to the title of the section for example “Tabla de la información de la vivienda (TVIV)” which translates to “Housing information table” and contains the residence characteristics, the number of persons living on it and the number of people sharing the costs of space.

As seen on the past image the tables contain different type of data values, for example string such as “Aguascalientes” to identify the state of origin of the women, and integers that will act as labels for the multiple question answers to specific questions.

[Entity relation model, author: INEGI]

Twitter Data

For the implementation of this project 2 APIs were used to collect data from twitter, one through python using the Tweepy API which is an open-source Python package that gives access to the Twitter API and the second one through R using the Rtweet API An implementation of calls designed to collect and organize Twitter data via Twitter’s REST API.

Python implementation is shown next:

Code

import pandas as pd
import json
import tweepy


class TweetGathering:
    def __init__(
        self,
        consumer_key,
        consumer_secret,
        access_token,
        access_token_secret,
        bearer_token=None,
    ):
        self.consumer_key = consumer_key
        self.consumer_secret = consumer_secret
        self.access_token = access_token
        self.access_token_secret = access_token_secret
        self.bearer_token = None
        self.set_auth()

    def set_auth(self):
        self.auth = tweepy.OAuthHandler(
            self.consumer_key, self.consumer_secret)
        self.auth.set_access_token(self.access_token, self.access_token_secret)
        self.api = tweepy.API(self.auth)

    def gather_tweets(
        self,
        query,
        max_results=100,
        tweet_fields=["text", "created_at", "id_str"],
        start_time="",
        end_time="",
        geocode="22.001591,-101.043063,800km"
    ):
        num_needed = 10000
        tweet_list = []
        user_list = []
        screen_name_list = []
        tw_id = []
        last_id = -1
        while len(tweet_list) < num_needed:
            try:
                new_tweets = self.api.search_tweets(q=query, include_rts=False, geocode=geocode, count=num_needed, max_id=str(
                    last_id - 1), tweet_mode='extended',)
            except tweepy.TweepyException as e:
                print("Error", e)
                break
            else:
                if not new_tweets:
                    print("Could not find any more tweets!")
                    break
                else:
                    for tweet in new_tweets:
                        screen_name = tweet.author.screen_name
                        user_name = tweet.author.name
                        tweet_text = tweet.full_text
                        tweet_list.append(tweet_text)
                        user_list.append(user_name)
                        screen_name_list.append(screen_name)

                        tw_id.append(tweet.id)
            last_id = min(tw_id)

        df = pd.DataFrame({'Screen name': screen_name_list,
                           'Username': user_list,
                           'Tweets': tweet_list})
        df = df.drop_duplicates()
        return df

    def save_tweets_csv(self, search_results):
        header = ""
        for field in self.tweet_fields:
            header += str(field) + ","
        with open("tweet_search.txt", "w", encoding="utf-8") as file:
            file.write(header + "\n")

        for i in range(0, (self.max_results - 1)):
            tweet_data = ""
            for field in self.tweet_fields:
                tweet_data += (
                    str(
                        str(search_results[i]._json[field])
                        .replace("\n", " ")
                        .replace(",", " ")
                    )
                    + ","
                )
                print(field, search_results[i]._json[field])
            url = (
                "https://twitter.com/i/web/status/" +
                search_results[i]._json["id_str"]
            )
            tweet_data += url + "\n"
            with open("tweet_search2.txt", "a", encoding="utf-8") as file:
                file.write(tweet_data)

For the first stage of this project 10000 tweets were collected through Tweepy and 1000 through Rtweet both searches were querying tweets with the word “Feminista” (feminist). For this stage the R script outputs a txt file and generates a worldcloud to do a simple analysis of the words within the collected set and the Python script collects metadata from the tweet such as “id” and “creation_date”, the information is stored on a csv file.

Code

df = pd.read_csv('data/feminista_tweets.csv')
df.head()

	text	created_at	url
0	RT @Omnia_Somnia: @Llega_Ixtab Habla de los ex...	Wed Sep 14 16:49:38 +0000 2022	https://twitter.com/i/web/status/1570092418799...
1	RT @angelamrobledo: Al Presidente @petrogustav...	Wed Sep 14 16:49:26 +0000 2022	https://twitter.com/i/web/status/1570092370972...
2	RT @PODEMOS: "El objetivo que les hemos trasla...	Wed Sep 14 16:49:22 +0000 2022	https://twitter.com/i/web/status/1570092351921...
3	RT @PODEMOS: "El objetivo que les hemos trasla...	Wed Sep 14 16:49:19 +0000 2022	https://twitter.com/i/web/status/1570092341926...
4	RT @NuRakell: @Brujaycotilla Los mismos que ll...	Wed Sep 14 16:49:12 +0000 2022	https://twitter.com/i/web/status/1570092312641...

Conclusion

Data gathering might be a hard task to perfom during normal Data Science cycles.

Since the project has the specific purpose of looking at an already provided dataset the collection or gathering of this data was not hard to perform.

In the same manner Twitter API has become extremely easy to retrieve information in both R and python.