Decision Trees

Introduction

Through this method we are also looking into different techniques to improve our data modelling and finally, improving the prediction.

Some of the techniques to improve data modelling include dropping columns with single and unnecessary values. Secondly Standardize all of our columns given some of their values have wide ranges and finally, we will also be looking at the feature selection technique. Feature selection provides multiple advantages such as removing irrelevant and redundant features that have no relation with the target variable and are unrelated to the task the model is designed to solve. Doing this helps to solve dimensionality problems related to a large training time and difficult in interpreting the model’s results.

Besides reducing the number of features that our model will receive we can “tune” our model through hyperparameters. Tuning means finding the optimal hyperparameter values for the algorithm to maximize the model’s performance with the data it received and finally improve it’s prediction results.

Going Further into the implementation, for decision trees we decided to utilize different data. We extracted columns from the data we have already obtained and cleaned. This columns are answers from the ENDIREH survey described before, the answers we extracted describe the characteristics of each mexican women’s past and actual relationships and characteristics about their couple. We want to use those characteristics to predict if a woman has experienced emotional violence in her relationship. Finally we have 425 columns each representing one question or characteristic.

Model

Decision Trees are a non-parametric supervised learning classification models which can learn from data to approximate a value based on a set of if-then-else decision rules, that will end up looking like a tree. The decision criteria are more complex and the model is more accurate the deeper the tree is.

Decision Trees are simple to understand and to visually interpret. We can see them as a white box model where if a feature is important to the model, then the condition is easily observed by a boolean decision.

On the other hand Decision Trees can create over-complex trees leading to overfitting. To avoid this issue, mechanisms like pruning, minimum number of samples required at a leaf node, or maximum depth of the tree are required.

Code
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, recall_score
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style
matplotlib.style.use('seaborn-pastel')

df = pd.read_csv('data/endireh_ev.csv', encoding='latin1')
print(df.shape)
df.loc[:, df.apply(pd.Series.nunique) != 1]
df.replace([np.inf, -np.inf], np.nan).dropna(axis=1)
print(df.shape)

df.head(5)
C:\Users\valer\AppData\Local\Temp\ipykernel_7968\1224648129.py:12: DtypeWarning: Columns (188) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/endireh_ev.csv', encoding='latin1')
(73500, 354)
(73500, 354)
P12_1 P12_2 P12_3 P12_4 P12_5 P12_6 P12_7 P12_8 P12_9 P12_10 ... P4_13_4 P4_13_5 P4_13_6 P4_13_7 FAC_VIV_y.1 FAC_MUJ_y.1 ESTRATO_y.1 UPM_DIS_y.1 EST_DIS_y.1 label
0 2 1 2 3 1 3 3 8.0 8.0 3.0 ... 1.0 NaN NaN NaN 113 113 4 1 3 1.0
1 1 1 3 3 3 1 3 1.0 2.0 4.0 ... 5.0 NaN NaN NaN 113 113 4 1 3 0.0
2 2 1 1 3 3 3 1 3.0 8.0 1.0 ... 2.0 NaN NaN NaN 78 155 2 2 1 0.0
3 1 1 3 1 1 1 1 1.0 2.0 3.0 ... 1.0 NaN NaN NaN 78 78 2 2 1 0.0
4 1 1 4 3 3 3 3 8.0 8.0 3.0 ... 5.0 NaN NaN NaN 78 78 2 2 1 0.0

5 rows × 354 columns

To obtain the label we looked at two questions on the survey which described different situations were the woman has experienced psychological violence and she should select the date range when those happened or if those did not occur. We transformed those ranges into a binomial decision 1 if the woman has suffered emotional violence on her present or past relationships and 0 if she has not.

While cleaning we realized the classes were not balanced, around 70% of the data were labeled as 0 or “no emotional violence” thus we decided to randomly select a small number of data points that were labeled as 0. Finally, the data is almost balanced:

Code
#check balance of the classes
def check_balance(df):
    print(df.label.value_counts())
    
check_balance(df)
0.0    54940
1.0    18560
Name: label, dtype: int64

Since we have a big number of features we want to normalize them before splitting our data for training the model.

Code
y=df["label"]
X = df.drop("label", axis=1)

categorical = df.select_dtypes(include=['object']).columns.tolist()
X = X.drop(categorical, axis=1)
X= X.fillna(0)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.to_numpy())
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y)

Training a Decision Tree Classifier using Sklearn

Code
from sklearn import tree
model = tree.DecisionTreeClassifier()
Code
model.fit(X_train, y_train)
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
y_pred = model.predict(X_test)

Measuring model based on predictions

Code
from sklearn.metrics import classification_report, confusion_matrix
def print_report(y_test, y_pred):
    clf_report_linear = classification_report(y_test, y_pred, output_dict=True)
    print(pd.DataFrame(clf_report_linear).transpose())

print_report(y_test, y_pred)
              precision  recall  f1-score  support
0.0                 1.0     1.0       1.0  13752.0
1.0                 1.0     1.0       1.0   4623.0
accuracy            1.0     1.0       1.0      1.0
macro avg           1.0     1.0       1.0  18375.0
weighted avg        1.0     1.0       1.0  18375.0
Code
from sklearn.metrics import plot_confusion_matrix
def plot_cm(cm):
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                )

    fig, ax = plt.subplots(figsize=(10,8))
    ax.tick_params(axis='x', labelrotation = 45)
    disp.plot(ax=ax,xticks_rotation='vertical',)
    plt.show()
cm_test = confusion_matrix(y_test,y_pred)
plot_cm(cm_test)

After looking at the results we can see the classifier perfectly classified all test values, we will plot the decision tree to understand this behavior

Code
import graphviz
def plot_tree(model,X_train,y_train,keys):
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(model, feature_names=keys, filled=True)
    fig.savefig("decision_tree.png")
    dot_data = tree.export_graphviz(model, out_file=None,feature_names=keys)
    graph = graphviz.Source(dot_data, format="png") 
    graph
print()
keys = X.keys()
plot_tree(model,X_train,y_train,keys)

By looking at it we can understand this is a mistake. Feature P_14_2_10 is redundant to the feature we used as label P_14_1_10. Both question described similar emotional violence experiences, so if a women answered yes to P_14_2_10 (most important feature) they answered yes to our label (P_14_1_10). We can see this as a cleaning mistake so we will drop this column and re train.

Code
from sklearn.feature_selection import VarianceThreshold
y=df["label"]
X = df.drop("label", axis=1)

categorical = df.select_dtypes(include=['object']).columns.tolist()
X = X.drop(categorical, axis=1)
X = X.drop("P14_2_14", axis=1)
X = X.drop("P14_3_14", axis=1)
X = X.drop("P14_2_10", axis=1)
X = X.drop("P14_3_10", axis=1)
X= X.fillna(0)

X_scaled = scaler.fit_transform(X.to_numpy())
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)


selector = VarianceThreshold(threshold=.07)
X_vr = selector.fit_transform(X_scaled)

keys = X.keys()
keys_var = [keys[i] for i, f in enumerate(selector.get_support()) if f]

X_train, X_test, y_train, y_test = train_test_split(X_vr,y)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print_report(y_test, y_pred)
#plot_tree(model,X_train,y_train,keys_var)
              precision  recall  f1-score  support
0.0                 1.0     1.0       1.0  13743.0
1.0                 1.0     1.0       1.0   4632.0
accuracy            1.0     1.0       1.0      1.0
macro avg           1.0     1.0       1.0  18375.0
weighted avg        1.0     1.0       1.0  18375.0

After dropping the column we can see our model is still small, the reason behind it is that all features starting with P14_ and ending with 14 and 10 describe emotional violence and the scenarios were it happened, thus the model is picking those up as the most important features to predict emotional violence.

Since we want to understand which are the most important characteristics to predict emotional violence, without having information regarding emotional violence, we decided to drop all columns related to it.

Code
y=df["label"]
X = df.drop("label", axis=1)

categorical = df.select_dtypes(include=['object']).columns.tolist()
X = X.drop(categorical, axis=1)
X = X.drop("P14_2_14", axis=1)
X = X.drop("P14_3_14", axis=1)
X = X.drop("P14_1_14", axis=1)
X = X.drop("P14_2_10", axis=1)
X = X.drop("P14_3_10", axis=1)
X = X.drop("P14_1_10", axis=1)
X= X.fillna(0)

X_scaled = scaler.fit_transform(X.to_numpy())
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

keys = X.keys()

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print_report(y_test, y_pred)
#plot_tree(model,X_train,y_train,keys)
              precision    recall  f1-score       support
0.0            0.916007  0.920085  0.918041  13702.000000
1.0            0.762576  0.752621  0.757566   4673.000000
accuracy       0.877497  0.877497  0.877497      0.877497
macro avg      0.839291  0.836353  0.837804  18375.000000
weighted avg   0.876987  0.877497  0.877230  18375.000000

After having a model which was not trained with “biased” data we can see a bigger tree. The accuracy was better than expected, but the tree might be overfitted given how deep the three looks like, we will experiment on hyperparameter tuning to improve the model results.

Code
#Experimenting with threshold around .001 and .1  best .07
selector = VarianceThreshold(threshold=.07)
X_vr = selector.fit_transform(X)

print(X.shape)
print(X_vr.shape)

X_train, X_test, y_train, y_test = train_test_split(X_vr,y)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print_report(y_test, y_pred)
#plot_tree(model,X_train,y_train)
(73500, 337)
(73500, 259)
              precision    recall  f1-score       support
0.0            0.916043  0.919049  0.917543  13712.000000
1.0            0.759636  0.752305  0.755953   4663.000000
accuracy       0.876735  0.876735  0.876735      0.876735
macro avg      0.837839  0.835677  0.836748  18375.000000
weighted avg   0.876352  0.876735  0.876537  18375.000000

Experimenting with Variance Threshold parameter increased the model accuracy

Code
from sklearn.feature_selection import SelectKBest, chi2
print(X.shape)
#Experimenting with k number of feature selection best 350
selector = SelectKBest(chi2, k=320)
X_chi= selector.fit_transform(X, df["label"])

k_f_names = X.iloc[:,selector.get_support(indices=True)]


X_train, X_test, y_train, y_test = train_test_split(X_chi,y)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print_report(y_test, y_pred)
#plot_tree(model,X_train,y_train,k_f_names.columns)
(73500, 337)
              precision    recall  f1-score       support
0.0            0.911383  0.919579  0.915463  13678.000000
1.0            0.759510  0.739621  0.749434   4697.000000
accuracy       0.873578  0.873578  0.873578      0.873578
macro avg      0.835447  0.829600  0.832448  18375.000000
weighted avg   0.872562  0.873578  0.873023  18375.000000

Experimenting with Select K best feature importance k values, the best k value is 320 and the accuracy increased in comparison to using variance threshold

Code
test_results = []
train_results = []
for num_layer in range(1,20):
    model = tree.DecisionTreeClassifier(max_depth=num_layer)
    model = model.fit(X_train,y_train)

    yp_train=model.predict(X_train)
    yp_test=model.predict(X_test)

    test_results.append([num_layer,accuracy_score(y_test, yp_test),recall_score(y_test, yp_test,pos_label=0),recall_score(y_test, yp_test,pos_label=1)])
    train_results.append([num_layer,accuracy_score(y_train, yp_train),recall_score(y_train, yp_train,pos_label=0),recall_score(y_train, yp_train,pos_label=1)])

labels = ["","Accuracy (Y=0)","Recall (Y=0)","Recall (Y=1)"]
for i in range(1,4):
    plt.subplots(1, figsize=(10, 10))
    plt.plot(range(1,20), [x[i] for x in test_results], label="Test score", color="red",marker="o")
    plt.plot(range(1,20), [x[i] for x in train_results], label="Training score", color="blue",marker="o")

    plt.title("Learning Curve")
    plt.xlabel("Number of layers in Decision Tree (max_depth)",fontsize=16), plt.ylabel(
        labels[i],fontsize=16), plt.legend(loc="best")
    plt.show()

Experimenting with the Number of layers in the Decision Tree hyperparameter (max depth) By looking at our plots we can see the model’s peak is around 5 and 7.5, closer to 7.5 thus we selected the best max_depth as 7

Code
model = tree.DecisionTreeClassifier(max_depth=7)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print_report(y_test, y_pred)
              precision    recall  f1-score       support
0.0            0.914127  0.960374  0.936680  13678.000000
1.0            0.864669  0.737279  0.795909   4697.000000
accuracy       0.903347  0.903347  0.903347      0.903347
macro avg      0.889398  0.848827  0.866294  18375.000000
weighted avg   0.901484  0.903347  0.900696  18375.000000

By looking at our metrics we can observe the impact of hyperparameter tuning in our model

Code
plot_tree(model,X_train,y_train,k_f_names.keys())

Code
#random classifier
import numpy as np
import random
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
def random_classifier(y_data):
    ypred=[]
    max_label=np.max(y_data); #print(max_label)
    for i in range(0,len(y_data)):
        ypred.append(int(np.floor((max_label+1)*np.random.uniform(0,1))))
    print("count of prediction:",Counter(ypred).values()) # counts the elements' frequency
    print("probability of prediction:",np.fromiter(Counter(ypred).values(),dtype=float)/len(y_data)) # counts the elements' frequency
    print("accuracy",accuracy_score(y_data, ypred))
    print("precision, recall, fscore,support",precision_recall_fscore_support(y_data,ypred))

print("\nBINARY CLASS: ENDIREH data")
random_classifier(y_train)

BINARY CLASS: ENDIREH data
count of prediction: dict_values([27545, 27580])
probability of prediction: [0.49968254 0.50031746]
accuracy 0.49844897959183676
precision, recall, fscore,support (array([0.74680928, 0.2497731 ]), array([0.499176  , 0.49628508]), array([0.59838471, 0.33230294]), array([41262, 13863], dtype=int64))

As a baseline comparison, we are implementing a random binary classifier trained with our data to compare it with the model we just built. Based on this results we can see that our model did better by more than 30% of accuracy

As a conclusion the “mistakes” on the data modelling (not dropping all columns related to emotional violence) made it easier to understand the logic behind decision trees. After removing those, we were able to see the impact of correct data modelling through manual cleaning, feature selection and hyperparameter tuning.

After dropping unnecessary columns we had almost 400 features, the Variance Threshold feature selection technique reduced them to 315 which improved the accuracy but did not performed as well as The K best feature selection which after reducing to 350 provided the best accuracy for the model. The two results show how, even with a big set of features, reducing most of them will improve the accuracy of models.

Finally, we will continue the analysis of these results in the next tab by implementing a Random Forest Classifier