ariaard-blog - Tumblr blog

ariaard-blog · 6 years ago

Text

Assignment 4- K-Mean Cluster Analysis

For this assignment, I’m going to make use of the same set of variables I used in my previous assignment to run a K-means cluster analysis over them. The following variables are included which can represent characteristics on“ALCEVR1” (alcohol ever consumed). 'AGE','MAREVER1','COCEVER1','INHEVER1','CIGAVAIL','DEP1', 'ESTEEM1','VIOL1','PASSIST','FAMCONCT','GPA1','EXPEL1','SCHCONN1' The definition for each of these variables is provided in the course. The set combines both numerical and categorical variables. I split the original sample into “train” and “test” subsamples, to train the data on the “train-subsample” and test the derived model on the “test-subsample”. The ratio is chosen similar to the course, 70% training sample and 30% for the test sample. Before generating these subsamples, data is cleaned of missing values and all variables are standardized to a mean of zero and standard deviation of one. This procedure makes these variables comparable and unit-free. This step is important, especially, because outliers and variables with extremely high/low values can affect the distance from the centroids. K-means cluster analyses are conducted on the training data having k=1-15 clusters, using Euclidean distance. The first graph, shows the average distance from the cluster centroid to use the Elbow method for getting a sense about the appropriate number of clusters to select. The graph doesn’t provide a clear-cut solution and suggests that a 2 or 6-cluster solution might be interpreted.

In the next step, I’m going to interpret and compare the solution with different clustering levels, as indicated below. To reduce the 13 clustering variable down to a few variables that account for most of the variance in the clustering variables, I use Canonical discriminant analyses. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) The 2-cluster case shows that the two cluster have much overlap with each other, however, the within variance in the purple cluster is much higher. In the middle sub-graph showing a 3-cluster solution, the blue cluster seems to have a greater spread in comparison to the other two and accounts for most of the scattered points in the purple cluster in sub-figure 1. The last sub-figure on the left showing the case with 6 clusters indicates a great deal of overlap among clusters and the best cluster solution might have less clusters than 6.

I take the 3-cluster-solution for further comparisons and interpretations.

The means on the clustering variables shows that, compared to the other clusters, adolescents in the third cluster (cluster 2) are the most troubled. On average, it seems that they have a higher likelihood of consuming marijuana, cocaine and inhalers as well as a higher likelihood in terms of signs of depressions, history of violence and being expelled from school. While having the lowest level of GPA, self-steem, family connectedness and school connectedness. The first cluster (cluster 0) on the other hand, seems to be the least troubled given most of the features. For instance, having less likelihood of marijuana, cocaine, inhalers consumptions. Also, lowest level of depression, violence, and history of being expelled from school. While, adolescents in this cluster show the highest likelihood of self-esteem, family connectedness, GPA, and school connectedness.

To see how clusters differ in terms of ALCEVR1, I run a simple OLS having the latter as the response or dependent variable. The analysis of variance summary table indicates that the regression is statistically significant and that the clusters varied significantly on ALCEVR1. If we compare the means (table 3, below), cluster 2 (the most troubled group) had the highest level of ALCEVR1 while cluster 1 (the least troubled group) had the least level of ALCEVR1.

My code is available here: https://pastecode.xyz/view/4c02e04d

0 notes