machine-learning-da - Tumblr blog

machine-learning-da · 5 years ago

Text

k-means Cluster Analysis

A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 11 variables that represent characteristics that could have an impact on school achievement. Clustering variables included two binary variables measuring whether or not the adolescent had ever used alcohol or marijuana, as well as quantitative variables measuring alcohol problems, a scale measuring engaging in deviant behaviours (such as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school), and scales measuring violence, depression, self-esteem, parental presence, parental activities, family connectedness, and school connectedness.

To build a k-mans clustering model we perform the following steps. We import all the necessary libraries. We import the dataset and also clean the data. we will create a data set called cluster that includes only our clustering variables.

In cluster analysis variables with large values contribute more to the distance calculations.Variables measured on different scales should be standardized prior to clustering, so that the solution is not driven by variables measured on larger scales. We use the following code to standardize the clustering variables to have a mean of 0, and a standard deviation of 1.

clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))

clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))

clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))

clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))

clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))

clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64'))

clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))

clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))

clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))

clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64'))

clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))

We then train the data set using the train_test_split function which randomly split the data into training set and test set. Before cluster analysing we need to know the values of k this is achieved using the following code. The for k in clusters: code tells Python to run the cluster analysis code below for each value of k in the cluster's object.

from scipy.spatial.distance import cdist

clusters=range(1,10)

meandist=[].

After we have the average distance calculated for each of the 1 to 9 cluster solutions we can plot the elbow curve using the map plot lib plot function that we imported as plt.

This plot shows the decrease in the average minimum distance of the observations from the cluster centroids for each of the cluster solutions. We can see that the average distance decreases as the number of clusters increases. Since the goal of cluster analysis is to minimize the distance between observations and their assigned clusters, we want to choose the fewest numbers of clusters that provides a low average distance. What we're looking for in this plot is a bend in the elbow that kind of shows where the average distance value might be levelling off such that adding more clusters doesn't decrease the average distance as much.

Since we can see a bend at 3 we rerun the cluster analysis, this time asking for 3 clusters. So we create an object, model 3, which will contain the results from the cluster analysis with 3 clusters =KMeans, and in parenthesis, n_clusters=3. And we fit the model and create an object called clusassign that has the cluster assignments based on the 3 cluster model. we're going to use is use canonical discriminate analysis, which is a data reduction technique that creates a smaller number of variables that are linear combinations of the clustering variables mentioned. The new variables, called canonical variables, are ordered in terms of the proportion of variance and the clustering variables that is accounted for by each of the canonical variables. In Python, we can use the PCA function and the sklearn decomposition library to conduct the canonical discriminate analysis.We will plot the two canonical variables by the cluster assignment values from the 3 cluster solution in a scatter plot using the matplot libplot function.

Here is the scatter plot. What this shows is that these two clusters are densely packed, meaning that the observations within the clusters are pretty highly correlated with each other, and within cluster variance is relatively low. The left part of the plot appear to have a good deal of overlap, meaning that there is not good separation between these two clusters. On the other hand, this cluster here shows better separation, but the observations are more spread out indicating less correlation among the observations and higher within cluster variance.This suggests that the two cluster solution might be better, meaning that it would be especially important to further evaluate the two cluster solution as well. we can take a look at the pattern of means on the clustering variables for each cluster to see whether they are distinct and meaningful. To do this, we have to link the cluster assignment variable back to its corresponding observation in the clus_train dataset that has the clustering variables.

Multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster. Create a unique identifier variable from the index for the cluster training data to merge with the cluster assignment variable. Then create a list that has the new index variable and create a list of cluster assignments. Combine index variable list with cluster assignment list into a dictionary. Convert newlist dictionary to a dataframe and rename the cluster assignment column. Cow do the same for the cluster assignment variablecreate a unique identifier variable from the index for the cluster assignment dataframe to merge with cluster training data, then merge the cluster assignment dataframe with the cluster training variable dataframe by the index variable. Merge cluster assignment with clustering variables to examine cluster variable means by cluster. Finaly calculate clustering variable means by cluster by using group by.

The means on the clustering variables showed that compared to the other clusters, adolescents in the first cluster, cluster 0, had the highest likelihood of having used alcohol, but otherwise tended to fall somewhere in between the other two clusters on the other variables. On the other hand, the second cluster, cluster 1, clearly includes the most troubled adolescents. Adolescents in this cluster had the highest likelihood of having used alcohol, a very high likelihood of having used marijuana, more alcohol problems, and more engagement in deviant and violent behaviors compared to the other two clusters. They also had higher levels of depression, lower self-steem, and the lowest levels of school connectedness, parental presence, involvement of parent in activities, and family connectedness. The third cluster, cluster 2, appears to include the least troubled adolescents. Compared to adolescents in the other clusters, they were least likely to have used alcohol and marijuana, and had the lowest number of alcohol problems and deviant and violent behavior. They also had greater school and family connectedness.

Validate clusters in training data by examining cluster differences in GPA using ANOVA. have to merge GPA with clustering variables and cluster assignment data. Then split the GPA data for training and testing. We then print the mean GPA in standard deviation for each cluster using the groupby function.

The analysis of variance summary table indicates that the clusters differed significantly on GPA. When we examine the means, we find that not surprisingly, adolescents in cluster 1, the most troubled group, had the lowest GPA, and adolescents in cluster 2, the least troubled group, had the highest GPA. The tukey test shows that the clusters differed significantly in mean GPA, although the difference between cluster 0 and cluster 2 were smaller.

The full code for cluster analysis is as follows:

from pandas import Series, DataFrame

import pandas as pd

import numpy as np

import matplotlib.pylab as plt

from sklearn.model_selection import train_test_split

from sklearn import preprocessing

from sklearn.cluster import KMeans

data = pd.read_csv("tree_addhealth.csv")

data.columns = map(str.upper, data.columns)

data_clean = data.dropna()

# subset clustering variables

cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1',

'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']]

cluster.describe()

# standardize clustering variables to have mean=0 and sd=1

clustervar=cluster.copy()