Clustering

Introduction

For this unsupervised learning task, we will focus on the ENDIREH survey results to understand the patterns and relations among answers on the survey about emotional violence.

We want to test if the clustering techniques are capable of separating our data into two clusters, the same way it is labeled for emotional violence and also perform an exploratory analysis of unaware relations. Since the data has more than 3 thousand columns, only an expert could deeply understand them; thus, analyzing our data to comprehend the relationship between questions without the “emotional violence” label is essential.

Model

Clustering algorithms partition sample data making similar data points grouped closer and different data points more distant from those in other groups.

Clustering is an unsupervised learning method. This method aims to find meaningful patterns, generative features, and groupings in sample data. The method’s flexibility makes it capable of performing different tasks depending on the user’s needs and data. For example, clusters’ applications are data reduction, outlier detection, or pattern search among influential data groups.

Clustering types

An essential difference while choosing clustering methods is whether the data clusters should be overlapped or separated. Therefore, this work focuses on two main types of clustering: Partitional and Hierarchical.

- Partitional clustering:

Selects sets of data into non-overlapping clusters. The most popular method is k-means which creates k number of groups, and each data point corresponds to one cluster.

The density-based approaches are a component of the partitional method. Density-based methods consider the groups with similarities as dense regions; the rest are lower dense regions of the space. These methods have good accuracy and can merge two clusters, for example, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) model.

- Hierarchical clustering:

Performs nested clustering with a similar organization to a tree where new clusters are formed using the previously formed one. There are two subtypes of hierarchical clustering techniques, agglomerative or bottom-up and division or top-down.

The agglomerative type, contrary to k-means, does not assume the number of clusters, but one of the disadvantages of this technique is the high computational cost of the model. On the other hand, the Balanced Iterative Reducing Clustering and using Hierarchies (BIRCH) model requires the K number of clusters. Still, it has a low computational cost improving the clustering of sets with a high number of features.

K-means:

As previously mentioned, the k-means algorithm creates a k number of groups where each data point corresponds to one cluster. Firstly, the algorithm determines the best k-center points or “centroids.” Afterward, it will assign each data point to the closest centroid based on similarity or distance, as distance metric k-means minimizes squared Euclidean distance.

The model will continue to assign points to the nearest centroid and re-compute until it is converged. Although the method tends to fall into local minima, it can be solved by running several times.

DBSCAN:

As mentioned, this algorithm considers groups with similarities as dense regions and the lower dense regions of the space. One of its main advantages is not defining the number of clusters before running.

The algorithm defines the similarities as how “reachable” a point is from another. The densities are determined by the number of “neighbors” or close points a sample has; this method also measures similarity through a distance metric. Outliers are detected as points that are not closer to others in low-density regions to finally exclude them, which can be helpful for specific tasks.

Agglomerative:

The agglomerative or bottom-up technique obtains its name from how it clusters the data since each object is initially considered a “leaf.” The algorithm finds the two most similar clusters at each iteration and combines them to create a giant cluster or branch. It will iterate until all the data points are part of just one single big cluster (root), which can be seen as the “beginning of the “tree.”

Visualizing the agglomerative technique helps to determine the “optimal” number of clusters.

BIRCH:

Its main advantage is its high accuracy with huge datasets, given its reduced memory usage. BIRCH attempts to minimize the memory by summarizing information in dense regions by creating compact representations or subclusters called Clustering Feature (CF) entries.

As K-means, Birch requires the K desired number of clusters to divide the data and does not accept categorical attributes given that the data points are represented by coordinates in a Euclidean space.

Hyper-parameter tuning:

The most critical parameter to be tuned for clustering is the number of groups we want to divide the data on; for models that require that parameter.

Techniques such as the elbow method and silhouette are implemented to do an exploratory analysis of the parameters and finally select the best one.

- elbow method

This method is used to find the best number of clusters for k-means. It looks for the model with the lowest inertia and the lowest number of clusters; however, as the number of clusters increases, the inertia decreases, and the other way around. The elbow method aims to find the exact value as the decrease in inertia begins to slow.

- Silhouette method

This method aims to understand the distance between the clusters generated. It is a visual technique that shows how close the points of a cluster are to their cluster’s neighbors. Through it, the distance of clusters can be visualized in a range from -1 to 1.

Implementation

This unsupervised learning task will focus on the ENDIREH survey results. Since the data has more than 3 thousand columns, only an expert could deeply understand them; thus, analyzing our data to comprehend the relationship between questions without the “emotional violence” label is essential.

In order to do so, we will start by performing a feature selection implementing the best resulting method from the past Random Forest tab. Since the best feature selection only drops a small number of columns, it should be sufficient for improving the model accuracy but still, show relationships among our data.

Code
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
from statistics import mode
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.cluster.hierarchy as sch
import sklearn.cluster as cluster
sns.set_theme(style = "whitegrid")
Code
df = pd.read_csv('data/endireh_ev.csv', encoding='latin1')

#eliminating columns with unique values 
df = df.loc[:, df.apply(pd.Series.nunique) != 1]
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna(axis=1, how='all')
df = df.dropna(axis=0, how='all')
label = df["label"]
df = df.drop(columns=["label","P14_2_14","P14_3_14","P14_1_14","P14_2_10","P14_3_10","P14_1_10"])
print(df.info())
C:\Users\valer\AppData\Local\Temp\ipykernel_3032\361463827.py:1: DtypeWarning: Columns (188) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('data/endireh_ev.csv', encoding='latin1')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73500 entries, 0 to 73499
Columns: 346 entries, P12_1 to EST_DIS_y.1
dtypes: float64(264), int64(72), object(10)
memory usage: 194.0+ MB
None
Code
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2

categorical = df.select_dtypes(include=['object']).columns.tolist()
df = df.drop(categorical, axis=1)
df= df.fillna((min(df[min(df)]))-1)

scaler = StandardScaler()
selector = SelectKBest(chi2, k=330)

df_selected= selector.fit_transform(df,label)
k_f_names = df.iloc[:,selector.get_support(indices=True)]

df_scaled = scaler.fit_transform(df_selected)

K-means

Since we have a binary label for emotional violence, the exploration will be performed to see if the clustering techniques are capable of separating the data as our label does by using a k equals to two.

Code
KmeanModel = KMeans(n_clusters=2,init = "k-means++", random_state = 42)
predictions = KmeanModel.fit_predict(df_scaled)
Code
outliers = []

unique_cluster, counts_cluster = np.unique(predictions, return_counts=True)
unique_real, counts_real = np.unique(label, return_counts=True)

outliers.append(counts_cluster[0])

f, (ax1, ax2) = plt.subplots(ncols=2, dpi=120,)
sns.barplot(x=unique_cluster,
            y=counts_cluster,
            color = 'violet',
            ax=ax1)
ax1.set_ylabel("Violence estimator", fontsize=12)
ax1.set_xlabel("Mexico States", fontsize=12)
ax1.set_title("Cluster", fontsize=12)

sns.barplot(x=unique_real,
            y=counts_real,
            color = 'violet',
            ax=ax2)
            
ax1.set_ylabel("Data samples", fontsize=12)
ax1.set_xlabel("Cluster labeled data", fontsize=12)
ax1.set_title("Cluster", fontsize=12)

ax2.set_ylabel("", fontsize=16)
ax2.set_xlabel("Labels (emotional violence)", fontsize=12)
ax2.set_title("ENDIREH labeled data", fontsize=12)
f.tight_layout()

K-means was capable of clustering the data on a similar balance as the original label, our data set is not balanced which can affect the performance of other models but it is important to see how this unbalanced behavior is replicated by the model. Although we cannot assume the clustering was divided by the “emotional violence” metric we have, there might be a relation that can be useful to predict emotional violence.

We will compare the samples with our label and the cluster obtained label through a confusion matrix

Code
from sklearn.metrics import plot_confusion_matrix
def plot_cm(cm):
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                )

    fig, ax = plt.subplots(figsize=(10,8))
    ax.tick_params(axis='x', labelrotation = 45)
    disp.plot(ax=ax,xticks_rotation='vertical',)
    plt.show()

predictions_n = np.select([predictions == 1, predictions == 0], [0, 1], predictions)
cm_test = confusion_matrix(predictions_n,label)
plot_cm(cm_test)

The results of the confusion matrix show how the model had issues with the data point that should have been labeled as “no emotional violence” and were labeled as “emotional violence”

Finally, to compare the results, we will use the most critical features defined in the Decision Trees tab, which are “P4_2” corresponding to: “How frequently does your couple (who doesn’t live with you) visits you?” and “P4_5_AB”: “Economic income range of your partner.” These two features will be used to visualize how our data points are sampled when labeled as emotional violence or not.

Code
predictions_cluster = pd.DataFrame(data=[predictions,label]).T
predictions_cluster= predictions_cluster.rename(columns={0: "cluster", 1: "endireh"})
predictions_cluster["P4_2"]=df["P4_2"]
predictions_cluster["P4_5_AB"]=df["P4_5_AB"]
print(predictions_cluster.head(3))
   cluster  endireh    P4_2  P4_5_AB
0      1.0      1.0     0.0      0.0
1      1.0      0.0  8500.0      0.0
2      0.0      0.0   400.0      0.0
Code
sns.scatterplot(x="P4_5_AB",y="P4_2", hue = "cluster",data=predictions_cluster,palette="deep").set(title='Data sampling based on feature importance')
[Text(0.5, 1.0, 'Data sampling based on feature importance')]

Code
sns.scatterplot(x="P4_5_AB",y="P4_2", hue = "endireh",data=predictions_cluster,palette="deep").set(title='Data sampling based on feature importance')
[Text(0.5, 1.0, 'Data sampling based on feature importance')]

It is interesting to see how both behave the same. In this case we don’t care about the value of the label from the clustering assignment since it doesn’t have a representation, but we rather care about how the data was sampled, which is similar for the original labels and the clusters.

hyperparameter tuning

for k means clustering we will use the elbow method to find the optimal number of clusters. we will use the inertia_ attribute to find the sum of squared distances of samples to their closest cluster center. we will use the range of 1 to 15 clusters

Code
distortions = []
inertias = []
k = 16

for k in range (1,k):
    KmeanModel = KMeans(n_clusters=k,init = "k-means++")
    KmeanModel.fit(df_scaled)

    distortions.append(sum(np.min(cdist(df_scaled, KmeanModel.cluster_centers_,"euclidean"),axis=1))/df_scaled.shape[0])
    inertias.append(KmeanModel.inertia_)
    evaluation=pd.DataFrame.from_records({"cluster":np.arange(1,k+1), "Distortion":distortions, "Intertia":inertias})

evaluation
Distortion Intertia cluster
0 15.163802 2.425500e+07 1
1 14.587213 2.244653e+07 2
2 14.031899 2.108461e+07 3
3 13.615084 2.041345e+07 4
4 13.466566 1.974018e+07 5
5 13.209368 1.932672e+07 6
6 13.163584 1.896982e+07 7
7 13.007563 1.864908e+07 8
8 12.969019 1.830074e+07 9
9 12.803127 1.798712e+07 10
10 12.768043 1.771935e+07 11
11 12.775683 1.762728e+07 12
12 12.680777 1.743828e+07 13
13 12.537925 1.718932e+07 14
14 12.503489 1.696472e+07 15
Code
evaluation.plot.line(x= "cluster", subplots = True)
array([<AxesSubplot:xlabel='cluster'>, <AxesSubplot:xlabel='cluster'>],
      dtype=object)

We can see distortion and inertia reduces at a hyperparameter at 8 thus, we will use that as best K for k-means.

Code
KmeanModel = KMeans(n_clusters=8,init = "k-means++", random_state = 42)
predictions = KmeanModel.fit_predict(df_scaled)
Code
predictions_cluster["cluster_8"]=predictions
print(predictions_cluster.head(3))
   cluster  endireh    P4_2  P4_5_AB  cluster_8
0      1.0      1.0     0.0      0.0          4
1      1.0      0.0  8500.0      0.0          7
2      0.0      0.0   400.0      0.0          3
Code
sns.scatterplot(x="P4_5_AB",y="P4_2", hue = "cluster_8",data=predictions_cluster,palette="deep").set(title='Data sampling based on feature importance')
[Text(0.5, 1.0, 'Data sampling based on feature importance')]

There might be a different pattern different to what we are looking for, which is emotional violence. This doesn’t mean there is no clustering among two groups but rather the similarity is closer in those 8 groups than having two big groups.

Through this we can assume there are patterns among the women who has suffered emotional violence.

DBSCAN clustering

For this method we will also experiment with 2 clusters to compare the values against k-means

Code
from sklearn.cluster import DBSCAN

predictions = DBSCAN(eps=2.5, min_samples=2).fit(df_scaled)
predictions_cluster["DBSCAN"] = predictions.labels_

print(predictions_cluster.head(3))
   cluster  endireh    P4_2  P4_5_AB  cluster_8  DBSCAN
0      1.0      1.0     0.0      0.0          4      -1
1      1.0      0.0  8500.0      0.0          7      -1
2      0.0      0.0   400.0      0.0          3      -1
Code
unique_cluster, counts_cluster = np.unique(predictions_cluster["DBSCAN"], return_counts=True)
sorted = list(np.sort(counts_cluster))

print(sorted[-20:])

outliers.append(counts_cluster[0])
[5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 9, 10, 10, 11, 12, 12, 13, 23, 51, 72559]

From the repetition of the labels we can see there is one big cluster for 72559 points while the rest of them are labeled different from each other. We could assume this as the rest of the points being outliers or, since we know our data is unbalanced we can assume those points as our negative label.

We will try to replicate the experiment with a bigger number of min samples to see if those points can be grouped together

Code
predictions = DBSCAN(eps=2.5, min_samples=200).fit(df_scaled)
predictions_cluster["DBSCAN"] = predictions.labels_
unique_cluster, counts_cluster = np.unique(predictions_cluster["DBSCAN"], return_counts=True)
sorted = list(np.sort(counts_cluster))
print(sorted[-20:])
[73500]

Increasing the number of samples did not produced a meaningful result. It forced the model to generate a single cluster.

Agglomerative Clustering (Hierarchical clustering)

Among the disadvantages for agglomerative clustering is the high usage of memory, for this method the set had to be reduced by 60% in samples.

Code
# Perform Agglomerative Clustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
#Separate the data
X_train_6, X_train_4, label_6, label_4= train_test_split(df_scaled,label,train_size=.6)
Code
predictions = AgglomerativeClustering().fit(X_train_4)
print(predictions.labels_)
[0 1 0 ... 0 0 0]
Code
unique_cluster, counts_cluster = np.unique(predictions.labels_, return_counts=True)
unique_real, counts_real = np.unique(label_4, return_counts=True)

outliers.append(counts_cluster[0])

f, (ax1, ax2) = plt.subplots(ncols=2, dpi=120,)
sns.barplot(x=unique_cluster,
            y=counts_cluster,
            color = 'violet',
            ax=ax1)
ax1.set_ylabel("Violence estimator", fontsize=12)
ax1.set_xlabel("Mexico States", fontsize=12)
ax1.set_title("Cluster", fontsize=12)

sns.barplot(x=unique_real,
            y=counts_real,
            color = 'violet',
            ax=ax2)
            
ax1.set_ylabel("Data samples", fontsize=12)
ax1.set_xlabel("Cluster labeled data", fontsize=12)
ax1.set_title("Cluster", fontsize=12)

ax2.set_ylabel("", fontsize=16)
ax2.set_xlabel("Labels (emotional violence)", fontsize=12)
ax2.set_title("ENDIREH labeled data", fontsize=12)
f.tight_layout()

Even after reducing the dataset by 60% the agglomerative reproduced the behavior of having a small number of data on one class. Although the behavior is similar, agglomerative had a really small number of datapoints on one class which can be related to outliers rather to the separation of the target we are looking for.

Code
"""
Commenting due to the high usage of memory to re-compute the dendrogram

Z = linkage(X_train_4, method="ward")
dend = dendrogram(Z)

"""
'\nCommenting due to the high usage of memory to re-compute the dendrogram\n\nZ = linkage(X_train_4, method="ward")\ndend = dendrogram(Z)\n\n'

dendrogram

Even though the model did not performed as good as K- means by comparing the “emotional violence” labels, by looking at the dendrogram generated with the agglomerative technique we can visualize the huge number of features, thus truncating is important. On the other hand, the model not performing so good or providing a huge cluster can be seen on this visualization, the red separation might be the class with less data points while the big class has a huge number of clusters inside.

The meanshift technique will be used to continue the exploration of the optimal number of clusters

Code
from sklearn.cluster import MeanShift, estimate_bandwidth
from itertools import cycle

model = MeanShift(bandwidth=2).fit(X_train_4)
labels = model.labels_
cluster_center = model.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : ", n_clusters_)
number of estimated clusters :  29368

The result of this technique shows a huge number of clusters in our data. Two assumptions can be made from the number of estimated clusters, which is half the number of data points we have; our data is so sparsed we only have couples of data points or assume the same behavior seen from DBSCAN and the agglomerative technique, where we have a big data set with smaller outliers around it.

BIRCH

Although the birch method should reduce memory usage, the dataset is too complex to perform on a low-memory laptop thus we will also use the reduced train for this exploration

Code
from sklearn.cluster import Birch

brc = Birch(n_clusters = 2).fit(X_train_4)
labels_brc =  brc.predict(X_train_4)
labels_brc
array([0, 1, 0, ..., 0, 0, 0], dtype=int64)
Code
unique_cluster, counts_cluster = np.unique(labels_brc, return_counts=True)
unique_real, counts_real = np.unique(label, return_counts=True)

outliers.append(counts_cluster[0])

f, (ax1, ax2) = plt.subplots(ncols=2, dpi=120,)
sns.barplot(x=unique_cluster,
            y=counts_cluster,
            color = 'violet',
            ax=ax1)
ax1.set_ylabel("Violence estimator", fontsize=12)
ax1.set_xlabel("Mexico States", fontsize=12)
ax1.set_title("Cluster", fontsize=12)

sns.barplot(x=unique_real,
            y=counts_real,
            color = 'violet',
            ax=ax2)
            
ax1.set_ylabel("Data samples", fontsize=12)
ax1.set_xlabel("Cluster labeled data", fontsize=12)
ax1.set_title("Cluster", fontsize=12)

ax2.set_ylabel("", fontsize=16)
ax2.set_xlabel("Labels (emotional violence)", fontsize=12)
ax2.set_title("ENDIREH labeled data", fontsize=12)
f.tight_layout()

Code
df_4 = pd.DataFrame(data = X_train_4)
df_4["label_4"]=label_4
df_4["labels_brc"]=labels_brc
X_train_4_T=np.transpose(X_train_4)
df_4["P4_5_AB"]=X_train_4_T[(int(df.columns.get_loc("P4_5_AB")))-1]
df_4["P4_2"]=X_train_4_T[(int(df.columns.get_loc("P4_2")))-1]
df_4.head()
0 1 2 3 4 5 6 7 8 9 ... 324 325 326 327 328 329 label_4 labels_brc P4_5_AB P4_2
0 -0.51561 -0.332921 -1.018516 0.603163 0.719775 0.733493 0.527630 1.625763 1.487464 -0.705399 ... -0.092724 0.099570 -0.270179 -1.411049 0.484570 0.495523 1.0 0 -0.703517 0.309286
1 -0.51561 -0.332921 1.447801 -0.670137 -0.452887 -1.497353 -2.146448 -1.371684 -1.184129 -2.658528 ... 1.685468 0.991112 1.249769 0.983707 -0.325979 -0.340321 0.0 1 -0.703517 -0.692631
2 -0.51561 -0.332921 -1.018516 0.603163 0.719775 0.733493 0.527630 -0.247641 -0.516231 0.271166 ... -0.092724 -0.300225 -0.033743 -1.411049 1.076728 1.043340 NaN 0 1.409664 1.311202
3 -0.51561 -0.332921 -0.196411 0.603163 0.719775 -1.497353 0.527630 -0.247641 -0.516231 1.247731 ... -0.092724 -0.504121 0.099377 -1.411049 1.586964 1.715404 NaN 0 0.353074 -0.692631
4 -0.51561 -0.332921 0.625695 0.603163 0.719775 0.733493 0.527630 -0.247641 -0.516231 0.271166 ... -0.092724 -0.048354 -0.343693 0.983707 1.257900 1.246654 0.0 0 1.409664 1.311202

5 rows × 334 columns

Code
sns.scatterplot(x="P4_5_AB",y="P4_2", hue = "labels_brc",data=df_4,palette="deep").set(title='Data sampling based on feature importance')
[Text(0.5, 1.0, 'Data sampling based on feature importance')]

Code
sns.scatterplot(x="P4_5_AB",y="P4_2", hue = "label_4",data=df_4,palette="deep").set(title='Data sampling based on feature importance')
[Text(0.5, 1.0, 'Data sampling based on feature importance')]

Although the model resulted similar to past approaches, from the scatterplot with reduced data we can see the behavior is similar to past ones, most data points with “no emotional violence” are the ones who have a lower value in P4_5_AB and P4_2

Finally we will perform the silhouette method to find the besy hyperparameters for BIRCH, to compare the results with a bigger number of clusters

Code
df_4 = df_4.drop(["label_4","labels_brc"], axis=1)
df_4
0 1 2 3 4 5 6 7 8 9 ... 322 323 324 325 326 327 328 329 P4_5_AB P4_2
0 -0.515610 -0.332921 -1.018516 0.603163 0.719775 0.733493 0.527630 1.625763 1.487464 -0.705399 ... -0.066881 -0.048825 -0.092724 0.099570 -0.270179 -1.411049 0.484570 0.495523 -0.703517 0.309286
1 -0.515610 -0.332921 1.447801 -0.670137 -0.452887 -1.497353 -2.146448 -1.371684 -1.184129 -2.658528 ... -0.066881 -0.048825 1.685468 0.991112 1.249769 0.983707 -0.325979 -0.340321 -0.703517 -0.692631
2 -0.515610 -0.332921 -1.018516 0.603163 0.719775 0.733493 0.527630 -0.247641 -0.516231 0.271166 ... -0.066881 -0.048825 -0.092724 -0.300225 -0.033743 -1.411049 1.076728 1.043340 1.409664 1.311202
3 -0.515610 -0.332921 -0.196411 0.603163 0.719775 -1.497353 0.527630 -0.247641 -0.516231 1.247731 ... -0.066881 -0.048825 -0.092724 -0.504121 0.099377 -1.411049 1.586964 1.715404 0.353074 -0.692631
4 -0.515610 -0.332921 0.625695 0.603163 0.719775 0.733493 0.527630 -0.247641 -0.516231 0.271166 ... -0.066881 -0.048825 -0.092724 -0.048354 -0.343693 0.983707 1.257900 1.246654 1.409664 1.311202
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29395 1.688878 2.277461 0.625695 0.603163 -1.625549 0.733493 -2.146448 -0.247641 -0.516231 1.247731 ... -0.066881 -0.048825 -0.092724 0.159539 -0.240376 -0.213671 -0.027242 0.004182 0.353074 -0.692631
29396 -0.515610 -0.332921 0.625695 -1.943437 -1.625549 -1.497353 -2.146448 -0.997003 -0.516231 0.271166 ... -0.066881 -0.048825 -0.092724 -0.224264 -0.431115 -0.213671 0.746088 0.727075 2.466254 -0.692631
29397 -0.515610 -0.332921 -1.018516 -1.943437 -1.625549 0.733493 0.527630 1.625763 1.487464 -0.705399 ... -0.066881 -0.048825 -0.092724 -0.032363 0.802725 -0.213671 0.368777 0.365628 0.353074 0.309286
29398 -0.515610 -0.332921 0.625695 0.603163 0.719775 0.733493 0.527630 -0.622322 -0.850180 -1.681964 ... -0.066881 -0.048825 -0.092724 -0.576084 -0.605959 0.983707 0.041486 0.015477 2.466254 -0.692631
29399 -0.515610 -0.332921 0.625695 -1.943437 -1.625549 -0.381930 0.527630 -0.997003 -0.516231 0.271166 ... -0.066881 -0.048825 -0.092724 0.395418 -0.123152 0.983707 -1.095764 -1.006738 0.353074 -0.692631

29400 rows × 332 columns

Code
from sklearn.metrics import silhouette_score

def maximize_silhouette(X,nmax=20,i_plot=False):
    # PARAM
    i_print=False

    #FORCE CONTIGUOUS
    X=np.ascontiguousarray(X) 

    # LOOP OVER HYPER-PARAM
    params=[]; sil_scores=[]
    sil_max=-10
    for param in range(2,nmax+1):
        model = Birch(n_clusters=param).fit(X)
        labels=model.predict(X)
        try:
            sil_scores.append(silhouette_score(X,labels))
            params.append(param)
        except:
            continue 

        if(i_print): print(param,sil_scores[-1])
        
        if(sil_scores[-1]>sil_max):
             opt_param=param
             sil_max=sil_scores[-1]
             opt_labels=labels

    print("OPTIMAL PARAMETER =",opt_param)

    if(i_plot):
        fig, ax = plt.subplots()
        ax.plot(params, sil_scores, "-o")  
        ax.set(xlabel='Hyper-parameter', ylabel='Silhouette')
        plt.show()

    return opt_labels

maximize_silhouette(df_4,nmax=5,i_plot=True)
OPTIMAL PARAMETER = 3

array([0, 1, 2, ..., 2, 2, 2], dtype=int64)

Training with optimal parameter 3

Code
model = Birch(n_clusters=3).fit(df_4)
C:\Users\valer\anaconda3\envs\ANLY501\lib\site-packages\sklearn\utils\validation.py:1858: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  warnings.warn(
Code
#df_4 = df_4.drop(df_4.columns[332], axis=1)
labels_brc =  model.predict(df_4)
C:\Users\valer\anaconda3\envs\ANLY501\lib\site-packages\sklearn\utils\validation.py:1858: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  warnings.warn(
Code
unique_cluster, counts_cluster = np.unique(labels_brc, return_counts=True)
unique_real, counts_real = np.unique(label, return_counts=True)

f, (ax1, ax2) = plt.subplots(ncols=2, dpi=120,)
sns.barplot(x=unique_cluster,
            y=counts_cluster,
            color = 'violet',
            ax=ax1)
ax1.set_ylabel("Violence estimator", fontsize=12)
ax1.set_xlabel("Mexico States", fontsize=12)
ax1.set_title("Cluster", fontsize=12)

sns.barplot(x=unique_real,
            y=counts_real,
            color = 'violet',
            ax=ax2)
            
ax1.set_ylabel("Data samples", fontsize=12)
ax1.set_xlabel("Cluster labeled data", fontsize=12)
ax1.set_title("Cluster", fontsize=12)

ax2.set_ylabel("", fontsize=16)
ax2.set_xlabel("Labels (emotional violence)", fontsize=12)
ax2.set_title("ENDIREH labeled data", fontsize=12)
f.tight_layout()

Code
df_4["labels_brc_3"] = labels_brc
sns.scatterplot(x="P4_5_AB",y="P4_2", hue = "labels_brc_3",data=df_4,palette="deep").set(title='Data sampling based on feature importance')
[Text(0.5, 1.0, 'Data sampling based on feature importance')]

Results

Results

K-means performed quickly in terms of computational time, and although it followed the same unbalanced labeling as the original, it had many mistaken data points. From the confusion matrix, we can assume the point errors were mostly on false negative values. From the scatterplot of the two most essential features, we can visualize how the true positives increase while the range of those answers increases as well. That behavior was replicated with the K-means model labeling.

Although the elbow technique does not help in terms of clustering by “emotional” violence or not, the result was significant in terms of exploration, returning the best value of 8. The results for DBSCAN were not as straightforward to compute and understand; the model could not differentiate between emotional violence or not but detected the outliers since it labeled together 72559 points leaving around 500 points clustered by pairs.

The agglomerative technique performed similarly to DBSCAN, returning a small number of labels for one class, leading to an outlier detection rather than separating the two classes. A dendrogram was generated with the agglomerative technique, which helped understand the relations among clusters. It shows how two groups of outliers branch out to smaller groups while a big group contains a more considerable number of clusters; we can assume the big group is the “emotional violence” data points. The mean-shift technique was implemented to support this method’s results, returning almost 30,000 clusters as a result (half of the total data points).

Although the BIRCH method aims to reduce memory usage, it needed to allocate more to run the whole dataset. After reducing the samples, the method performed similarly to past ones, returning a minimal number of samples for one category after clustering into two groups. Finally, we used the silhouette method to find the optimal number of clusters within 1 and 5 (given the memory usage of the algorithm). The result was 3 clusters. How the data was sampled with 3 clusters also demonstrates two sub-clusters among the outliers or “non-violent” cases.

Code
techniques = ["Kmeans","DBSCAN","Agglomerative","BIRCH"]
f, ax = plt.subplots(dpi=120)
sns.barplot(x=techniques,
            y=outliers,
            color = 'orange',
)
ax.set_ylabel("number of data points detected as outliers", fontsize=12)
ax.set_xlabel("Techniques that detected the points", fontsize=12)
ax.set_title("Clustering techniques number of outliers detection", fontsize=12)
f.tight_layout()

Conclusion

Among the implemented clustering techniques, the results were quite similar between all of them. Most of the techniques classified the “non-violent” examples as outliers and could find around 5 to 8 clusters inside the “violent” data points.

In terms of errors, most models separated false negatives incorrectly. In real-life applications for social problems having a model that predicts false negatives might be a better scenario than a model predicting false negatives.

Even though we were looking for the models’ results to separate the data by “emotional violence” and “non-emotional violence,” the elbow method in k-means and the silhouette method in BIRCH explore different patterns among data. From it, we can deeply explore the cases where emotional violence occurs, moving aside the non-violent cases to understand the social problem. One gave, as a result, the optimal number of clusters as eight and the other three.

Clustering served to do an exploratory analysis of our ENDIREH Data. These techniques helped us to realize how unbalanced our set is and how important it would be to drop the outliers and “non-emotional violence” data points to find more meaningful relations.

In future steps, the data will be reduced in terms of redundant features to have a smaller set. Outliers will also be removed to re-compute the past methods and understand the clusters that were pointed out by the model’s results.