#Figure 2  Plot of the first two canonical variables for the clustering variables by cluster
Explore tagged Tumblr posts
ggype123 · 1 year ago
Text
K-Means Cluster Analysis for Identifying Adolescent Subgroups
Introduction
A k-means cluster analysis was conducted to identify distinct subgroups of adolescents based on their responses to 11 variables associated with characteristics that could impact school achievement. The goal was to group adolescents into clusters with similar patterns of responses, providing insights into underlying subgroups.
Methodology
The analysis used the following 11 standardized clustering variables:
Binary Variables: Ever used alcohol, Ever used marijuana
Quantitative Variables:
Alcohol problems
Deviant behavior (vandalism, lying, stealing, etc.)
Violence
Depression
Self-esteem
Parental presence
Parental activities
Family connectedness
School connectedness
All variables were standardized to a mean of zero and a standard deviation of one.
The dataset was split into a training set (70% of observations, N=3201N = 3201N=3201) and a test set (30% of observations, N=1701N = 1701N=1701). K-means cluster analyses were performed on the training set for k=1k = 1k=1 to k=9k = 9k=9 clusters, using Euclidean distance. The proportion of variance accounted for by the clusters (R-squared) was plotted for each cluster solution to help determine the optimal number of clusters.
Results
Figure 1. Elbow Curve of R-Squared Values for Different Cluster Solutions
The elbow curve suggested that 2, 4, and 8-cluster solutions were plausible. The 4-cluster solution was chosen for interpretation due to its balance between complexity and interpretability.
To further explore the clusters, a canonical discriminant analysis reduced the clustering variables to two canonical variables.
Figure 2. Scatterplot of the First Two Canonical Variables by Cluster
The scatterplot showed distinct clusters with varying densities and spreads. Clusters 1 and 4 were densely packed with low within-cluster variance, while Cluster 3 showed the highest within-cluster variance.
Cluster Profiles:
Cluster 1: Adolescents with moderate levels on most variables. They had low likelihoods of using alcohol or marijuana, moderate levels of depression, and self-esteem, but relatively low school connectedness, parental presence, parental involvement, and family connectedness.
Cluster 2: Higher levels of the clustering variables compared to Cluster 1, with a higher likelihood of alcohol and marijuana use. They had moderate values compared to Clusters 3 and 4.
Cluster 3: The most troubled group. They had the highest likelihood of using alcohol and marijuana, more alcohol problems, and higher engagement in deviant and violent behaviors. They also exhibited higher depression, lower self-esteem, and the lowest levels of school connectedness, parental presence, parental involvement, and family connectedness.
Cluster 4: The least troubled group. They had the lowest likelihood of using alcohol and marijuana, the fewest alcohol problems, and the lowest engagement in deviant and violent behaviors. They also exhibited the lowest levels of depression, and higher self-esteem, school connectedness, parental presence, parental involvement, and family connectedness.
External Validation:
To validate the clusters, an Analysis of Variance (ANOVA) tested the differences in GPA between the clusters, followed by Tukey's post hoc test.
Results indicated significant differences in GPA across clusters F(3,3197)=82.28,p<.0001F(3, 3197) = 82.28, p < .0001F(3,3197)=82.28,p<.0001. Post hoc comparisons showed significant differences between all clusters except between Clusters 1 and 2.
Cluster 4: Highest GPA (mean = 2.99, SD = 0.71)
Cluster 3: Lowest GPA (mean = 2.36, SD = 0.78)
Syntax and Output
Below is the Python code used to perform the k-means clustering and the resulting output:
python
Copy code
# Import necessary libraries from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load the data # Assume data is in a DataFrame 'df' X = df[['ever_used_alcohol', 'ever_used_marijuana', 'alcohol_problems', 'deviant_behavior', 'violence', 'depression', 'self_esteem', 'parental_presence', 'parental_activities', 'family_connectedness', 'school_connectedness']] # Standardize the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Determine the optimal number of clusters using the elbow method inertia = [] for k in range(1, 10): kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_) # Plot the elbow curve plt.figure(figsize=(10, 6)) plt.plot(range(1, 10), inertia, marker='o') plt.xlabel('Number of Clusters') plt.ylabel('Inertia') plt.title('Elbow Curve for K-Means Clustering') plt.show() # Perform k-means clustering with 4 clusters kmeans = KMeans(n_clusters=4, random_state=42) clusters = kmeans.fit_predict(X_scaled) # Add cluster labels to the original data df['Cluster'] = clusters # Canonical discriminant analysis using PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) # Scatter plot of the first two principal components plt.figure(figsize=(10, 6)) sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=clusters, palette='viridis') plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.title('Scatter Plot of the First Two Canonical Variables by Cluster') plt.show() # ANOVA to validate clusters with GPA import scipy.stats as stats gpa_anova = stats.f_oneway(df[df['Cluster'] == 0]['GPA'], df[df['Cluster'] == 1]['GPA'], df[df['Cluster'] == 2]['GPA'], df[df['Cluster'] == 3]['GPA']) print(f'ANOVA F-statistic: {gpa_anova.statistic:.2f}, p-value: {gpa_anova.pvalue:.5f}')
Output:
yaml
Copy code
ANOVA F-statistic: 82.28, p-value: 0.00000
Interpretation
The k-means cluster analysis identified four distinct subgroups of adolescents based on their responses to the clustering variables. These clusters varied in terms of substance use, behavioral issues, and levels of parental and school connectedness. Cluster 3 was characterized by the most problematic behaviors, while Cluster 4 represented the least troubled group. The ANOVA results confirmed significant differences in GPA between these clusters, validating the cluster solution.
This analysis provides insights into the different profiles of adolescents and their potential impact on school achievement. Such information could be valuable for targeted interventions aimed at improving the school experience for various subgroups.
0 notes
datenanalystxyz · 1 year ago
Text
Machine Learning for Data Analysis week4
K Mean Clustering
A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 9 variables that represent characteristics that could have an impact on paranoide personality disorder. Clustering variables included 8 binary variables and 1 quantitative variable
quantitative variable; 'AGE' # 'AGE'
8 binary variables; 'TAB12MDX' #nicotindependence in last 12 Months 'MAJORDEPLIFE' #Major depresion 'MAJORDEPP12' #Major depresion in last 12 Months 'S4AQ4A16' #attempted suicide 'S3BQ1A6' # ever used Cocaine or Crack 'OBCOMDX2'# Obsessiv_compulsive Per.disorde 'SCHIZDX2'# Schizoid per. disorder 'HISTDX2' # histrionic per. disorder
responce variable : 'PARADX2' # Paranoid per. disorder
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations .
A series of k-means cluster analyses were conducted on the training data specifying k=1-20 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the 20 cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret. It was for me only a test
Tumblr media
The elbow curve was inconclusive, suggesting that the 2, 8 and 12-cluster solutions might be interpreted. The results below are for an interpretation of the 6-cluster solution.
Canonical discriminant analyses was used to reduce the 9 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure shown below) indicated that the observations in clusters overlap very much with the other clusters.
it s mybe better to do in 3 Clusters
Tumblr media
I have changed cluster n in 3 Clusters
Tumblr media
The means on the clustering variables showed that, compared to the other clusters, paranoide mean in cluster 2 is higher .
They had a relatively high likelihood histrionic per.disorder and obsessiv_compulsive Per.disorder. and schizoid per.disorder.
in Cluster 2 is suiside, nicotindependence, cocain use level higher then in other clusters
Cluster 0 has the lowest Paranoide level
we have to add number of people in each cluster
Cluster (0) 4866
Cluster (1) 4318
Cluster (2) 385
Tumblr media
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on grade paranoide . A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on paranoide
. The tukey post hoc comparisons showed significant differences between clusters on paranoide.
paranoide in cluster 2 had the highest paranoide (mean=0.563636 sd=0.496579), and cluster 0 had the lowest paranoide (mean=0.040, sd=0.19).
my code below:
Tumblr media
code
Tumblr media
code
Tumblr media
code
Tumblr media
code
Tumblr media
code
Tumblr media
code
Tumblr media Tumblr media
code
Tumblr media Tumblr media
code
Tumblr media Tumblr media Tumblr media
code
Tumblr media
######################
import numpy as np import matplotlib.pyplot as plt
Definiere die Funktionen
def f1(x): return 2*x + 3
def f2(x): return 3*x + 2
def f3(x): return 4*x + 5
def f4(x): return 5*x + 4
Definiere die Funktion für die Berechnung der neuen Steigung und y-Achsenabschnitt
def adjust_function(f, x_offset): slope = (f(x_offset + 1) - f(x_offset)) / 1 # Berechne die Steigung intercept = f(x_offset) - slope * x_offset # Berechne den y-Achsenabschnitt return lambda x: slope * x + intercept # Neue Funktion mit angepasster Steigung und y-Achsenabschnitt
Wähle den Offset-Punkt
x_offset = 2
Passe die Funktionen an
f1_adjusted = adjust_function(f1, x_offset) f2_adjusted = adjust_function(f2, x_offset) f3_adjusted = adjust_function(f3, x_offset) f4_adjusted = adjust_function(f4, x_offset)
Plotte die Funktionen
x_values = np.linspace(0, 10, 100) plt.plot(x_values, f1_adjusted(x_values), label='f1 adjusted') plt.plot(x_values, f2_adjusted(x_values), label='f2 adjusted') plt.plot(x_values, f3_adjusted(x_values), label='f3 adjusted') plt.plot(x_values, f4_adjusted(x_values), label='f4 adjusted')
plt.xlabel('x') plt.ylabel('f(x)') plt.title('Adjusted linear functions') plt.legend() plt.grid(True) plt.show()
#########
import numpy as np import matplotlib.pyplot as plt
Definiere die Funktionen
def f1(x): return 2*x + 3
def f2(x): return 3*x + 2
def f3(x): if x < 7: return 4x + 5 else: return 2x - 9 # Anpassen der Funktion 3 ab x=7
def f4(x): return 5*x + 4
Definiere die Funktion für die Berechnung der neuen Steigung und y-Achsenabschnitt
def adjust_function(f, x_offset): slope = (f(x_offset + 1) - f(x_offset)) / 1 # Berechne die Steigung intercept = f(x_offset) - slope * x_offset # Berechne den y-Achsenabschnitt return lambda x: slope * x + intercept # Neue Funktion mit angepasster Steigung und y-Achsenabschnitt
Wähle den Offset-Punkt
x_offset = 1
Passe die Funktionen an
f1_adjusted = adjust_function(f1, x_offset) f2_adjusted = adjust_function(f2, x_offset) f3_adjusted = f3 # Funktion 3 wird nicht angepasst, da sie unabhängig sein soll f4_adjusted = adjust_function(f4, x_offset)
Plotte die Funktionen
x_values = np.arange(1, 26) # X-Achse von 1 bis 25 plt.plot(x_values, f1_adjusted(x_values), label='f1 adjusted') plt.plot(x_values, f2_adjusted(x_values), label='f2 adjusted') plt.plot(x_values, f3_adjusted(x_values), label='f3 adjusted') plt.plot(x_values, f4_adjusted(x_values), label='f4 adjusted')
plt.xticks(np.arange(1, 26, 1)) # Setze die Schritte der X-Achse auf 1 plt.xlabel('x') plt.ylabel('f(x)') plt.title('Adjusted linear functions') plt.legend() plt.grid(True) plt.show()
äääääääääää
import numpy as np import matplotlib.pyplot as plt
Definiere die Funktionen
def f1(x): return 2*x + 3
def f2(x): return 3*x + 2
def f3(x): return 4*x + 5
def f4(x): return 5*x + 4
Berechne den Startpunkt für alle Funktionen
x_start = 1 y_start_f1 = f1(x_start) y_start_f2 = f2(x_start) y_start_f3 = f3(x_start) y_start_f4 = f4(x_start)
Definiere die verschobenen Funktionen
def f1_shifted(x): return f1(x) - y_start_f1
def f2_shifted(x): return f2(x) - y_start_f2
def f3_shifted(x): return f3(x) - y_start_f3
def f4_shifted(x): return f4(x) - y_start_f4
Plotte die verschobenen Funktionen
x_values = np.linspace(1, 10, 100) # Annahme eines Bereichs von 1 bis 10 für x plt.plot(x_values, f1_shifted(x_values), label='f1 shifted') plt.plot(x_values, f2_shifted(x_values), label='f2 shifted') plt.plot(x_values, f3_shifted(x_values), label='f3 shifted') plt.plot(x_values, f4_shifted(x_values), label='f4 shifted')
plt.xlabel('x') plt.ylabel('f(x)') plt.title('Shifted linear functions') plt.legend() plt.grid(True) plt.show()
0 notes
infotechall · 4 years ago
Text
K-means clustering analysis
# standardize clustering variables to have mean=0 and sd=1 clustervar=cluster_data.copy() clustervar['ALCOHOL']=preprocessing.scale(clustervar['ALCOHOL'].astype('float64')) clustervar['MALIC_ACID']=preprocessing.scale(clustervar['MALIC_ACID'].astype('float64')) clustervar['ASH']=preprocessing.scale(clustervar['ASH'].astype('float64')) clustervar['ASH_ALCANITY']=preprocessing.scale(clustervar['ASH_ALCANITY'].astype('float64')) clustervar['MAGNESIUM']=preprocessing.scale(clustervar['MAGNESIUM'].astype('float64')) clustervar['TOTAL_PHENOLS']=preprocessing.scale(clustervar['TOTAL_PHENOLS'].astype('float64')) clustervar['FLAVANOIDS']=preprocessing.scale(clustervar['FLAVANOIDS'].astype('float64')) clustervar['NONFLAVANOID_PHENOLS']=preprocessing.scale(clustervar['NONFLAVANOID_PHENOLS'].astype('float64')) clustervar['PROANTHOCYANINS']=preprocessing.scale(clustervar['PROANTHOCYANINS'].astype('float64')) clustervar['COLOR_INTENSITY']=preprocessing.scale(clustervar['COLOR_INTENSITY'].astype('float64')) clustervar['HUE']=preprocessing.scale(clustervar['HUE'].astype('float64')) clustervar['PROLINE']=preprocessing.scale(clustervar['PROLINE'].astype('float64'))from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans import sklearn.metricsIn this experimental analysis wine dataset were used to apply a K-means clustering analysis. The analysis was performed in order to identify the subgroups of wines based on the similarity of values from 12 variables. Clustering variables were included to identify the types of wines from the all observation variables, and quantitative variables such as Alcohol, Malic-acid, Ash, Alkalinity of ash, Magnesium, Total-phenols, Flavonoids, Non-flavonoid-phenols, Proanthocyanins, Color-intensity, Hue, Proline was included with clustering variables with standardized to have a mean of 0 and a standard deviation of 1.
The dataset was split into a training set that encompasses 70% of the observations(N=124) and a test set that included 30% of the observations (N=55).
Data were randomly split into a training set that included 70% of the observations (N=3201) and a test set that included 30% of the observations (N=1701). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1  Elbow curve of r-square values for the nine cluster solutions
Tumblr media
Canonical discriminant analyses were used to reduce the 12-clustering variable down a few variables that used for the most of the variance in the clustering variables
Figure 2  Plot of the first two canonical variables for the clustering variables by cluster
Tumblr media
======= ====The following is the python code===================
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans import sklearn.metrics
#Data Management os.chdir("E:\Python-works\dataset")
data = pd.read_csv('./wine-clustering.csv')
#upper-case all DataFrame column names data.columns = map(str.upper, data.columns)
# Data Management
data_clean = data.dropna()
# subset clustering variables cluster_data=data_clean[['ALCOHOL','MALIC_ACID','ASH','ASH_ALCANITY','MAGNESIUM', 'TOTAL_PHENOLS','FLAVANOIDS','NONFLAVANOID_PHENOLS','PROANTHOCYANINS', 'COLOR_INTENSITY','HUE','PROLINE']] cluster_data.describe()
# standardize clustering variables to have mean=0 and sd=1 clustervar=cluster_data.copy() clustervar['ALCOHOL']=preprocessing.scale(clustervar['ALCOHOL'].astype('float64')) clustervar['MALIC_ACID']=preprocessing.scale(clustervar['MALIC_ACID'].astype('float64')) clustervar['ASH']=preprocessing.scale(clustervar['ASH'].astype('float64')) clustervar['ASH_ALCANITY']=preprocessing.scale(clustervar['ASH_ALCANITY'].astype('float64')) clustervar['MAGNESIUM']=preprocessing.scale(clustervar['MAGNESIUM'].astype('float64')) clustervar['TOTAL_PHENOLS']=preprocessing.scale(clustervar['TOTAL_PHENOLS'].astype('float64')) clustervar['FLAVANOIDS']=preprocessing.scale(clustervar['FLAVANOIDS'].astype('float64')) clustervar['NONFLAVANOID_PHENOLS']=preprocessing.scale(clustervar['NONFLAVANOID_PHENOLS'].astype('float64')) clustervar['PROANTHOCYANINS']=preprocessing.scale(clustervar['PROANTHOCYANINS'].astype('float64')) clustervar['COLOR_INTENSITY']=preprocessing.scale(clustervar['COLOR_INTENSITY'].astype('float64')) clustervar['HUE']=preprocessing.scale(clustervar['HUE'].astype('float64')) clustervar['PROLINE']=preprocessing.scale(clustervar['PROLINE'].astype('float64'))
# split data into train and test sets clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
# k-means cluster analysis for 1-9 clusters                                                           from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]
for k in clusters:    model=KMeans(n_clusters=k)    model.fit(clus_train)    clusassign=model.predict(clus_train)    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))    / clus_train.shape[0])
#Plot average distance from observations from the cluster centroid #to use the Elbow Method to identify number of clusters to choose plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')
# Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(clus_train) clusassign=model3.predict(clus_train)
# plot clusters
from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()
# create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable
clus_train.reset_index(level=0, inplace=True)
# create a list that has the new index variable cluslist=list(clus_train['index'])
# create a list of cluster assignments labels=list(model3.labels_)
# combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist
# convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index') newclus
# rename the cluster assignment column newclus.columns = ['cluster']
# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True)
# merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pd.merge(clus_train, newclus, on='index') merged_train.head(n=100)
# cluster frequencies merged_train.cluster.value_counts()
# validate clusters in training data by examining cluster differences in ODR using ANOVA # first have to merge GPA with clustering variables and cluster assignment data gpa_data=data_clean['ODR']
# split GPA data into train and test sets gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123) gpa_train1=pd.DataFrame(gpa_train) gpa_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(gpa_train1, merged_train, on='index') sub1 = merged_train_all[['ODR', 'cluster']].dropna()
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
gpamod = smf.ols(formula='ODR ~ C(cluster)', data=sub1).fit() print (gpamod.summary())
print ('means for ODR by cluster') m1= sub1.groupby('cluster').mean() print (m1)
print ('standard deviations for ODR by cluster') m2= sub1.groupby('cluster').std() print (m2)
mc1 = multi.MultiComparison(sub1['ODR'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())
0 notes
rahuljoy-blog2 · 5 years ago
Text
Running a k-means Cluster Analysis
A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 5 variables that represent characteristics that could have an impact on school achievement. Clustering variables included two binary variables measuring whether or not the adolescent had ever used alcohol or marijuana, as well as quantitative variables measuring alcohol problems, a scale measuring engaging in deviant behaviors (such as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school), and scales measuring violence. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations (N=3201) and a test set that included 30% of the observations (N=1701). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
The elbow curve was inconclusive, suggesting that the 2, 3 and 4-cluster solutions might be interpreted. The results below are for an interpretation of the 4-cluster solution.
Canonical discriminant analyses was used to reduce the 5 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster is shown in Figure 2 below:
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster
Tumblr media
4 notes · View notes
apoorva-week4 · 2 years ago
Text
Week 4: Peer-graded Assignment: Running a k-means Cluster Analysis
This assignment is intended for Coursera course "Machine Learning for Data Analysis by Wesleyan University”.
It is for "Week 4: Peer-graded Assignment: Running a k-means Cluster Analysis".
I am working on k-means Cluster Analysis in Python.
Syntax used to run k-means Cluster Analysis
A k-means cluster analysis was conducted to identify underlying subgroups of real machine parameters based on their similarity of responses on 19 variables that represent characteristics that could have an impact on product yield loss. Clustering variables included only quantitative variables measuring different machine parameters. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data were randomly split into a training set that included 70% of the observations (N=116) and a test set that included 30% of the observations (N=50). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
2. Code used to run k-means Cluster Analysis
Tumblr media Tumblr media Tumblr media
3. Corresponding Output
Tumblr media
Figure 1. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
4. Interpretation
For Figure 1: The elbow curve was inconclusive, suggesting that the 2, 4 and 8-cluster solutions might be interpreted. All 3 were tested, yielding [F-statistic and Prob (F-statistic)] of: [0.5298,0.469]; [6.242,0.000725] and [3.73,0.00156] accordingly. The results below are for an interpretation of the 4-cluster solution (highest F-statistic and lowest Prob).
Canonical discriminant analyses was used to reduce the 19 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster indicated that the observations in cluster 1 was densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 2 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 3 and 4 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.
For Figure 2: In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on product failure rates (BINS_SUM).
A tukey test was used for post hoc comparisons between the clusters. Results indicated some significant differences between the clusters on BINS_SUM (F (3, 85) = 6.242, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on BINS_SUM for cluster vs. 1 and 3, however insignificance of all other clusters among each other. Samples in cluster 4 had the lowest BINS_SUM (mean=60.28, sd=10.89), and cluster 1 had the highest BINS_SUM (mean=76.35, sd=13.08).
0 notes
anuja05 · 3 years ago
Text
A k-means cluster analysis was conducted to identify underlying subgroups of houses based on their similarity of features on 18 variables that represent characteristics of the houses that could have an impact on the grades provided to the house. Clustering variables included price of the house, bedrooms, bathrooms, measurement of the house in square foot, measurement of the lot in square foot, floors, whether a house has a view to waterfront, whether a house has been viewed or not, condition of a house on a scale of 1 to 5, overall grade given to the housing unit based on grading system on a scale of 1 to 11, square footage of house apart from basement, square footage of the basement of the house, years since house is built, years since house is renovated, zip code, latitude, longitude, living room area in 2015, lot size area in 2015. All clustering variables were standardized to have a mean of zero and a standard deviation of one. Data were randomly split into a training set that included 70% of the observations (N= 1145) and a test set that included 30% of the observations (N= 491).  A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The average distance from observations was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.  
Elbow curve From the below figure, 3 or 4 cluster solutions might be interpreted.
Tumblr media Tumblr media
Canonical discriminant analyses was used to reduce the 18 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in 2 and 3 clusters are densely packed with relatively low within cluster variance and did not overlap very much with the other clusters. Cluster 1 observations had greater spread suggesting higher within cluster variance.
Tumblr media
Here the size of cluster 4 is too small also is has higher within cluster variation.
 The results of above both plots suggest that the best cluster solution will be 3 clusters.
The means on the clustering variables showed that, compared to the other clusters, houses in cluster 1 had moderate levels on the clustering variables. They had moderate house prices with moderate numbers of bedrooms. The size of the living house is also moderate compared to other clusters. Houses in cluster 2 have low prices compared to other clusters. Even their number of bedrooms and bathrooms, area under living house is low. Exception is the number of years since the house was built is moderate. Cluster 3 can be classified as the expensive and high graded houses as they have the highest price compared to other 2 clusters. Their number of bedrooms and bathrooms, also the living area and the area under the lot is highest. They have highest number of floors along with maximum area in the basement.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on grade assigned to these houses. A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on Grades (p<.0001). The tukey post hoc comparisons showed significant differences between clusters on Grades, with the exception that clusters 1 and 2 were not significantly different from each other. Adolescents in cluster 3 had the highest Grades (mean=1.98, sd=0.83), and cluster 2 had the lowest Grades (mean=1.21, sd=0.44).
Codes-
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
"""
Data Management
"""
data = pd.read_csv('kc_house_data.csv')
#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)
# subset clustering variables
cluster=data[['PRICE','BEDROOMS','BATHROOMS','SQFT_LIVING','SQFT_LOT','FLOORS','WATERFRONT',
'VIEW','CONDITION','SQFT_ABOVE','SQFT_BASEMENT','YR_BUILT','YR_RENOVATED','ZIPCODE',
'LAT','LONG','SQFT_LIVING15','SQFT_LOT15']]
cluster.describe()
# standardize clustering variables to have mean=0 and sd=1
clustervar=cluster.copy()
from sklearn import preprocessing
clustervar['PRICE']=preprocessing.scale(clustervar['PRICE'].astype('float64'))
clustervar['BEDROOMS']=preprocessing.scale(clustervar['BEDROOMS'].astype('float64'))
clustervar['BATHROOMS']=preprocessing.scale(clustervar['BATHROOMS'].astype('float64'))
clustervar['SQFT_LIVING']=preprocessing.scale(clustervar['SQFT_LIVING'].astype('float64'))
clustervar['SQFT_LOT']=preprocessing.scale(clustervar['SQFT_LOT'].astype('float64'))
clustervar['FLOORS']=preprocessing.scale(clustervar['FLOORS'].astype('float64'))
clustervar['WATERFRONT']=preprocessing.scale(clustervar['WATERFRONT'].astype('float64'))
clustervar['VIEW']=preprocessing.scale(clustervar['VIEW'].astype('float64'))
clustervar['CONDITION']=preprocessing.scale(clustervar['CONDITION'].astype('float64'))
clustervar['SQFT_ABOVE']=preprocessing.scale(clustervar['SQFT_ABOVE'].astype('float64'))
clustervar['SQFT_BASEMENT']=preprocessing.scale(clustervar['SQFT_BASEMENT'].astype('float64'))
clustervar['YR_BUILT']=preprocessing.scale(clustervar['YR_BUILT'].astype('float64'))
clustervar['YR_RENOVATED']=preprocessing.scale(clustervar['YR_RENOVATED'].astype('float64'))
clustervar['ZIPCODE']=preprocessing.scale(clustervar['ZIPCODE'].astype('float64'))
clustervar['LAT']=preprocessing.scale(clustervar['LAT'].astype('float64'))
clustervar['LONG']=preprocessing.scale(clustervar['LONG'].astype('float64'))
clustervar['SQFT_LIVING15']=preprocessing.scale(clustervar['SQFT_LIVING15'].astype('float64'))
clustervar['SQFT_LOT15']=preprocessing.scale(clustervar['SQFT_LOT15'].astype('float64'))
# split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
# k-means cluster analysis for 1-9 clusters                                                           
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]
for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(clus_train)
    clusassign=model.predict(clus_train)
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
    / clus_train.shape[0])
"""
Plot average distance from observations from the cluster centroid
to use the Elbow Method to identify number of clusters to choose
"""
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
# Interpret 3 cluster solution
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
# plot clusters
from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()
"""
BEGIN multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# create a unique identifier variable from the index for the
# cluster training data to merge with the cluster assignment variable
clus_train.reset_index(level=0, inplace=True)
# create a list that has the new index variable
cluslist=list(clus_train['index'])
# create a list of cluster assignments
labels=list(model3.labels_)
# combine index variable list with cluster assignment list into a dictionary
newlist=dict(zip(cluslist, labels))
newlist
# convert newlist dictionary to a dataframe
newclus=DataFrame.from_dict(newlist, orient='index')
newclus
# rename the cluster assignment column
newclus.columns = ['cluster']
# now do the same for the cluster assignment variable
# create a unique identifier variable from the index for the
# cluster assignment dataframe
# to merge with cluster training data
newclus.reset_index(level=0, inplace=True)
# merge the cluster assignment dataframe with the cluster training variable dataframe
# by the index variable
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
# cluster frequencies
merged_train.cluster.value_counts()
"""
END multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# FINALLY calculate clustering variable means by cluster
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
# validate clusters in training data by examining cluster differences in GRADE using ANOVA
# first have to merge GRADE with clustering variables and cluster assignment data
GRADE_data=data['GRADE']
# split GPA data into train and test sets
GRADE_train, GRADE_test = train_test_split(GRADE_data, test_size=.3, random_state=123)
GRADE_train1=pd.DataFrame(GRADE_train)
GRADE_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(GRADE_train1, merged_train, on='index')
sub1 = merged_train_all[['GRADE', 'cluster']].dropna()
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
GRADEmod = smf.ols(formula='GRADE ~ C(cluster)', data=sub1).fit()
print (GRADEmod.summary())
print ('means for GRADE by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
print ('standard deviations for GRADE by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
mc1 = multi.MultiComparison(sub1['GRADE'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
0 notes
zeenahawamdeh · 3 years ago
Text
Running a k-means Cluster Analysis
A k-means cluster analysis was conducted to identify underlying subgroups of lifestyles/habits based on 14 variables. Clustering variables included quantitative variables measuring
INCOME PER PERSON, ALC CONSUMPTION, ARMED FORCES RATE, BREAST CANCER PER 100TH. CO2 EMISSIONS, FEMALE EMPLOYRATE, HIV RATE, INTERNET USE RATE, OIL PER PERSON, POLITY SCORE, ELECTRIC PER PERSON, SUICIDE PER 100TH, EMPLOY RATE, URBAN RATE
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Tumblr media
Data were randomly split into a training set that included 70% of the observations (N=39) and a test set that included 30% of the observations (N=17). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
The elbow curve was inconclusive, suggesting that the 2 and 4 cluster solutions might be interpreted. The results below are for an interpretation of the 4-cluster solution.
Canonical discriminant analyses was used to reduce the 14 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 1 through 4 were distinct and did not overlap with the other clusters. Observations in cluster 4 or deep red were spread out more than the other clusters, showing high within cluster variance.
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
Tumblr media
The means on the clustering variables showed that, compared to the other clusters, cluster 1 had highest HIV Rate and cluster 4 had the income per person, highest urban rate, employ rate, oil per person, electric per person and internet use rate. All in all cluster 4 had a better life and thus higher life expectancy was evident. And higher HIV RATE points to lower life expectancy.
Tumblr media
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on LIFE EXPECTANCY. A tukey test was used for post hoc comparisons between the clusters. The tukey post hoc comparisons showed significant differences between clusters on, LIFE EXPECTANCY with the exception that clusters 2 and 3 were not significantly different from each other. People in cluster 4 had the highest LIFE EXPECTANCY (mean=78.06, sd=4.6), and cluster 1 had the lowest LIFE EXPECTANCY (mean=68.25, sd=10.31).
Tumblr media
0 notes
arpitabathija · 3 years ago
Text
Assignment 4
A k-means cluster analysis was conducted to identify underlying subgroups of lifestyles/habits based on 14 variables. Clustering variables included quantitative variables measuring
INCOME PER PERSON, ALC CONSUMPTION, ARMED FORCES RATE, BREAST CANCER PER 100TH. CO2 EMISSIONS, FEMALE EMPLOYRATE, HIV RATE, INTERNET USE RATE, OIL PER PERSON, POLITY SCORE, ELECTRIC PER PERSON, SUICIDE PER 100TH, EMPLOY RATE, URBAN RATE
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Tumblr media
Data were randomly split into a training set that included 70% of the observations (N=39) and a test set that included 30% of the observations (N=17). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
The elbow curve was inconclusive, suggesting that the 2 and 4 cluster solutions might be interpreted. The results below are for an interpretation of the 4-cluster solution.
Canonical discriminant analyses was used to reduce the 14 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 1 through 4 were distinct and did not overlap with the other clusters. Observations in cluster 4 or deep red were spread out more than the other clusters, showing high within cluster variance.
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
Tumblr media
The means on the clustering variables showed that, compared to the other clusters, cluster 1 had highest HIV Rate and cluster 4 had the income per person, highest urban rate, employ rate, oil per person, electric per person and internet use rate. All in all cluster 4 had a better life and thus higher life expectancy was evident. And higher HIV RATE points to lower life expectancy.
Tumblr media
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on LIFE EXPECTANCY. A tukey test was used for post hoc comparisons between the clusters. The tukey post hoc comparisons showed significant differences between clusters on, LIFE EXPECTANCY with the exception that clusters 2 and 3 were not significantly different from each other. People in cluster 4 had the highest LIFE EXPECTANCY (mean=78.06, sd=4.6), and cluster 1 had the lowest LIFE EXPECTANCY (mean=68.25, sd=10.31).
Tumblr media
0 notes
shareyourdatawithme · 3 years ago
Text
Machine Learning - Week 4
This is the fourth and final week of the Machine Learning course, and in this week’s assignment we are tasked with conducting a k-means cluster analysis.  The analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 7 variables. The variables represent characteristics that could have an impact on school achievement measured as GPA. Clustering variables included if marijuana was ever used, alcohol problems, gender, depression, self-esteem, if they have ever been expelled, and if they have ever used cocaine. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data was randomly split into a training set that included 70% of the observations (N=3,202) and a test set that included 30% of the observations (N=1,373). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret. The figure below is of the elbow curve.
Tumblr media
The elbow curve was inconclusive, suggesting that the 2, 3 and 5-cluster solutions might be interpreted. The rest of the analysis was conducted with an interpretation of the 3-cluster solution.
A scatterplot of the first two canonical variables by cluster (Figure below) indicated that the observations in all the clusters had no overlap.
Tumblr media
The means on the clustering variables showed that, compared to the other clusters, adolescents in cluster 1 had moderate levels on the clustering variables. They had a relatively low likelihood of using alcohol or marijuana, but moderate levels of depression and self-esteem. With the exception of having a high likelihood of having used alcohol or marijuana, cluster 2 had higher levels on the clustering variables compared to cluster 1.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on grade point average (GPA). A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on GPA (F=113.7, p=1.98e-48 – very small).
Tumblr media
The tukey post hoc comparisons showed significant differences between clusters on GPA.  Cluster 3 had the lowest GPA at 2.3 followed by cluster 2 at 2.5. The results of the tukey test indicated that all of the clusters were not significantly different from one another.  
Tumblr media
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
import os
 """
Data Management
"""
 os.chdir("C:\TREES")
 data = pd.read_csv("tree_addhealth.csv")
  #upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)
 # Data Management
 data_clean = data.dropna()
 # subset clustering variables
cluster=data_clean[['BIO_SEX','MAREVER1','ALCPROBS1','COCEVER1',
'DEP1','ESTEEM1','EXPEL1']]
cluster.describe()
 # standardize clustering variables to have mean=0 and sd=1
clustervar=cluster.copy()
clustervar['BIO_SEX']=preprocessing.scale(clustervar['BIO_SEX'].astype('float64'))
clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['EXPEL1']=preprocessing.scale(clustervar['EXPEL1'].astype('float64'))
clustervar['COCEVER1']=preprocessing.scale(clustervar['COCEVER1'].astype('float64'))
  # split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
 # k-means cluster analysis for 1-9 clusters                                                          
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]
 for k in clusters:
  model=KMeans(n_clusters=k)
  model.fit(clus_train)
   clusassign=model.predict(clus_train)
  meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
   / clus_train.shape[0])
 """
Plot average distance from observations from the cluster centroid
to use the Elbow Method to identify number of clusters to choose
"""
 plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
 # Interpret 3 cluster solution
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
# plot clusters
 from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()
 """
BEGIN multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# create a unique identifier variable from the index for the
# cluster training data to merge with the cluster assignment variable
clus_train.reset_index(level=0, inplace=True)
# create a list that has the new index variable
cluslist=list(clus_train['index'])
# create a list of cluster assignments
labels=list(model3.labels_)
# combine index variable list with cluster assignment list into a dictionary
newlist=dict(zip(cluslist, labels))
newlist
# convert newlist dictionary to a dataframe
newclus=DataFrame.from_dict(newlist, orient='index')
newclus
# rename the cluster assignment column
newclus.columns = ['cluster']
 # now do the same for the cluster assignment variable
# create a unique identifier variable from the index for the
# cluster assignment dataframe
# to merge with cluster training data
newclus.reset_index(level=0, inplace=True)
# merge the cluster assignment dataframe with the cluster training variable dataframe
# by the index variable
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
# cluster frequencies
merged_train.cluster.value_counts()
 """
END multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
 # FINALLY calculate clustering variable means by cluster
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
  # validate clusters in training data by examining cluster differences in GPA using ANOVA
# first have to merge GPA with clustering variables and cluster assignment data
gpa_data=data_clean['GPA1']
# split GPA data into train and test sets
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()
 import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
 gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())
 print ('means for GPA by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
 print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
 mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
0 notes
panospaterakis · 3 years ago
Text
Machine Learning for Data Analysis - Week 4
Assignment: Response variable: Excellent average gradeExplanatory variables: Systematic Smoking, Skipping school, Educational aspirations, Gender, White, African American, Asian, American Indian, Parents' education level, Parents' Care, being Adopted You will check the code that I used for this week's assignment below:
Code:
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Results:
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Summary of the k-means cluster analysis
A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 11 variables that represent characteristics that could impact school achievement. Clustering variables included The following explanatory variables were included systematic smoking, skipping school, educational aspirations, gender, White, African American, Asian, American Indian, parents' educational level, parents' caring level, and being adopted. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations (N=990) and a test set that included 30% of the observations (N=425). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
The elbow curve was inconclusive, suggesting that the 2, 3, 4 and 8-cluster solutions might be interpreted. The results below are for an interpretation of the 3-cluster solution.
Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in the green cluster, showing high within cluster variance. The other two clusters are densely packed, showing lower within group variance. The three clusters overlap suggesting that maybe it is better to evaluate a two cluster solution.
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
Tumblr media
The means on the clustering variables showed that, compared to the other clusters, adolescents in cluster 1 had a relatively high likelihood of smoking and skipping school daily and a lower likelihood of having parents that have finished tertiary education. On the other hand, cluster 2 and 3 had approximately the same likelihood of smoking and skipping school daily, while students in cluster 2 have a higher likelihood to have parents with tertiary education.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducted to test for significant differences between the clusters on excellent grade point average (GPA). A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on excellent GPA (F-stat=11.88, p<0.01). The tukey post hoc comparisons showed significant differences between clusters on excellent GPA, with the exception that clusters 1 and 2 were not significantly different from each other. Students in cluster 3 had the highest chance to receive an excellent average grade (mean=0.42, sd=0.49), and students in cluster 1 had the lowest (mean=0.19, sd=0.39).
0 notes
friwrite · 4 years ago
Text
Capstone project Draft 3
Preliminary Results
To create the predictor variable it was observed in the different classes their statistical distribution. For example, the PSD of the different tasks(Figure 2) can provide information about the correlation between Movement, Imaginary and Baseline information of all the 10 features extracted. . In previous works of EEG classification is mentioned that correlation between imagery and motor signals can contribute to the performance of a BCI.
  Figure 1. PSD of S1 and 3 tasks.
Tumblr media
   In the K-means cluster analysis, the variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.  
 Figure 2. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
The elbow curve was inconclusive, suggesting that the 2, 4 and 5-cluster solutions might be interpreted. The results below are for an interpretation of the 2-cluster solution.  
Canonical discriminant analyses was used to reduce the 10 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 2 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 3 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.  
Figure 3. Plot of the first two canonical variables for the clustering variables by cluster.
Tumblr media
 The means on the clustering variables showed that, compared to the other clusters, values in cluster 1 had moderate levels on the clustering variables. They had a relatively low percentage to belong to a inactivity class.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on the target variable. A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on target variable(80.4, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on the target variable, with the exception that clusters 1 and 2 were not significantly different from each other.  Values in cluster 1 had the highest value(mean= -0.0278 , sd=0.81), and cluster 2 had the lowest GPA (mean=-0.2496 , sd=0.19).
 For the Random Forest analysis the explanatory variables with the highest relative importance scores were  Shannon and the DWT. The accuracy of the random forest was nearly 100% (97.96%) after 25 decision trees (Figure 3)  adding little to the overall accuracy of the model, (that was aprox 94%) and suggesting that interpretation of a single decision tree may be appropriate. 
Note: Each trainning of a Random Forest, increases the accuracy of the model. 
 Figure 4. Random Forest accuracy after 25 decision trees
Tumblr media
  Table 1 compares the accuracy for machine learning algorithms of all the available features and pairs of features chosen by their importance for the system. The first column portrays the best percentage of accuracy and corresponds to all the available features. The last column shows the features (Shannon Entropy and Discrete Wavelet Transform pair) that compete with the first column performance. A significant difference in accuracy can be observed depending on the conforming of the selected pairs. The pairs of features which algorithms accuracy compete with the use of all the extracted features are the ones with major relevance for the system
Table 1. MI classification with ML algorithms comparison
Tumblr media
0 notes
mymlstudies · 4 years ago
Text
K-means Titanic
k-means cluster analysis was conducted to identify underlying subgroups of test data in the Kaggle titanic survival simulation scenario.
based on few passenger records on 4~6 variables that can possibly predict survivability on the titanic event.
Clustering variables measuring whether or not a passenger would be survived. Variables are Age, Fare, Embarked, Class, Sex, and Siblings. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations (N=499) and a test set that included 30%. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
The elbow curve was inconclusive, suggesting that the 3 and 4-cluster solutions might be interpreted. The results below are for an interpretation of the 4-cluster solution.
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
Tumblr media
Canonical discriminant analyses were used to reduce the 9-clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 2 and 3 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 1 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 4 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with only 2 clusters.
Tumblr media
The means on the clustering variables showed that, compared to the other clusters, nSurvived in cluster 2 and 3 had more chances to survived on the clustering variables. Compared to other clusters with less than 1 cluster means. Does having these groups with nSex closer to be with more composed of men and less likely to have siblings on board. Also with age group on the mid ranges. While Embarked, Fare, and Pclass shows no apparent significance.
Tumblr media
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on nSurvived. Cluster 1 is neglected due to its few sample size.
Tumblr media
SAS Code Used:
data clust;
set SASUSER.titanic;
if Sex="male" then nSex=2;
if Sex="female" then nSex=1;
if Embarked="S" then nEmbarked=1;
if Embarked="C" then nEmbarked=2;
if Embarked="Q" then nEmbarked=3;
if Survived=0 then nSurvived=1;
if Survived=1 then nSurvived=2;
idnum=_n_;
keep idnum Age nEmbarked Fare PassengerId Pclass Row nSex SibSp nSurvived;
if cmiss(of _all_) then delete;
run;
ods graphics on;
proc surveyselect data=clust out=traintest seed = 123
samprate=0.7 method=srs outall;
run;
data clus_train;
set traintest;
if selected=1;
run;
data clus_test;
set traintest;
if selected=0;
run;
proc standard data=clus_train out=clustvar mean=0 std=1;
var Age nEmbarked Fare PassengerId Pclass Row nSex SibSp nSurvived;
run;
%macro kmean(K);
proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300;
var Age nEmbarked Fare PassengerId Pclass Row SibSp nSex nSurvived;
run;
%mend;
%kmean(1);
%kmean(2);
%kmean(3);
%kmean(4);
%kmean(5);
%kmean(6);
%kmean(7);
%kmean(8);
%kmean(9);
data clus1;
set cluststat1;
nclust=1;
if _type_='RSQ';
keep nclust over_all;
run;
data clus2;
set cluststat2;
nclust=2;
if _type_='RSQ';
keep nclust over_all;
run;
data clus3;
set cluststat3;
nclust=3;
if _type_='RSQ';
keep nclust over_all;
run;
data clus4;
set cluststat4;
nclust=4;
if _type_='RSQ';
keep nclust over_all;
run;
data clus5;
set cluststat5;
nclust=5;
if _type_='RSQ';
keep nclust over_all;
run;
data clus6;
set cluststat6;
nclust=6;
if _type_='RSQ';
keep nclust over_all;
run;
data clus7;
set cluststat7;
nclust=7;
if _type_='RSQ';
keep nclust over_all;
run;
data clus8;
set cluststat8;
nclust=8;
if _type_='RSQ';
keep nclust over_all;
run;
data clus9;
set cluststat9;
nclust=9;
if _type_='RSQ';
keep nclust over_all;
run;
data clusrsquare;
set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9;
run;
* plot elbow curve using r-square values;
symbol1 color=blue interpol=join;
proc gplot data=clusrsquare;
plot over_all*nclust;
run;
* plot clusters for 4 cluster solution;
proc candisc data=outdata4 out=clustcan;
class cluster;
var Age nEmbarked Fare PassengerId Pclass Row SibSp nSex nSurvived;
run;
proc sgplot data=clustcan;
scatter y=can2 x=can1 / group=cluster;
run;
* validate clusters on GPA;
* first merge clustering variable and assignment data with GPA data;
data gpa_data;
set clus_train;
keep idnum SVD1;
run;
proc sort data=outdata4;
by idnum;
run;
proc sort data=SVD_data;
by idnum;
run;
data merged;
merge outdata4 SVD_data;
by idnum;
run;
proc sort data=merged;
by cluster;
run;
proc means data=merged;
var SVD1;
by cluster;
run;
proc anova data=merged;
class cluster;
model SVD1 = cluster;
means cluster/tukey;
run;
Thanks for reading!!!
0 notes
sarahayton · 4 years ago
Text
Running a k-means Cluster Analysis
A k-means cluster analysis was conducted to identify subgroups of people based on their similarity across 16 variables that might have an impact on marijuana use. Clustering variables included a number of characteristics including: age, alcohol and other substance use (ALCEVR1, ALCPROBS1, COCEVER1, INHEVER1), behavioral (DEP1, ESTEEM1, VIOL1), as well as educational (DEVIANT1, GPA1, EXPEL1) and family (FAMCONCT, PARACTV, PARPRES) characteristics. All predictors were standardized to have a mean value of zero and a standard deviation of one.
data = pd.read_csv("tree_addhealth.csv")
cluster=data_clean[['AGE', 'ALCEVR1', 'ALCPROBS1', 'COCEVER1', 'INHEVER1', 'CIGAVAIL', 'DEP1', 'ESTEEM1', 'VIOL1', 'PASSIST', 'DEVIANT1', 'GPA1', 'EXPEL1', 'FAMCONCT','PARACTV', 'PARPRES']]
Data were randomly split into a training set that included 70% of the observationsand a test set that included 30% of the observations. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(clus_train)
clusassign=model.predict(clus_train)
meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
Figure 1. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
The elbow curve was inconclusive, suggesting that the 2 and 3-cluster solutions might be interpreted. The results below are for an interpretation of the 3-cluster solution.
Canonical discriminant analyses was used to reduce the 16 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 4 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 2 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 3 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
pca_2 = PCA(2)
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
Tumblr media
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
Clustering variable means by cluster
cluster
AGE ALCEVR1 ALCPROBS1 COCEVER1 \
0 17.007748 0.685719 0.981627 0.109325
1 15.053934 -0.448153 -0.340980 0.008289
2 17.758390 0.129689 -0.161056 0.015962
INHEVER1 CIGAVAIL DEP1 ESTEEM1 VIOL1 PASSIST DEVIANT1 \
0 0.180064 0.446945 0.849564 -0.660382 0.830998 0.146302 1.178689
1 0.039940 0.229088 -0.310095 0.244037 -0.170265 0.092690 -0.366272
2 0.039106 0.292897 -0.139679 0.118059 -0.254260 0.091780 -0.220419
GPA1 EXPEL1 FAMCONCT PARACTV PARPRES
0 2.401125 0.098071 -1.096106 -0.439664 -0.563944
1 2.997048 0.017332 0.329184 0.114663 0.149519
2 2.835129 0.031923 0.220210 0.113718 0.132120
The means on the clustering variables showed that, compared to the other clusters, those in cluster 1 had the highest levels of alcohol use, alcohol problems, cocaine use, cigarette access and smoking, depression, violence, deviant behavior, and expulsion. This cluster also had the lowest self esteem, GPA, family connectedness, and parental presence. Those in cluster 2 were younger and had the lowest levels of prior alcohol or substance use, lowest levels of depression and expulsion, and the highest levels of self esteem and GPA. Those in cluster 3 clearly were the oldest and least violent, and fell between clusters 1 and 2 in all other characteristics.
m1= sub1.groupby('cluster').mean()
MAREVER1 cluster
0 0.612540
1 0.086662
2 0.223464
gpamod = smf.ols(formula='MAREVER1 ~ C(cluster)', data=sub1).fit()
mc1 = multi.MultiComparison(sub1['MAREVER1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
Multiple Comparison of Means - Tukey HSD, FWER=0.05 =================================================== group1 group2 meandiff p-adj lower upper reject
---------------------------------------------------
0 1 -0.5259 0.001 -0.5696 -0.4822 True
0 2 -0.3891 0.001 -0.4332 -0.345 True
1 2 0.1368 0.001 0.1014 0.1722 True
---------------------------------------------------
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on marijuana use. A tukey test was used for post hoc comparisons between the clusters. The tukey post hoc comparisons showed significant differences between all clusters on marijuana use (all p < 0.001). Those in cluster 1 had the highest marijuana use (mean=0.6125, sd=0.4876), and cluster 2 had the lowest marijuana use (mean=0.0866, sd=0.2814).
0 notes
ghaida-2525 · 4 years ago
Text
Running a k-means Cluster Analysis(HW4)
In this homework I have conducted k-means cluster analysis to perform grouping of fetals based on some similarities in some characteristics, that could impact their health. These are:
Baseline value
Uterine_contractions
Light_decelerations
Severe_decelerations
Prolongued_decelerations
Abnormal_short_term_variability
Mean_value_of_short_term_variability
Histogram_mode
Histogram_mean
Histogram_median
Histogram_variance
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1 in order to balance all scales.
Then I have randomly split data into train and test splits (70/30) to train and test my k-means model. In order to test influence of cluster number and select the best number of clusters, I have conducted series of analysis, fitting model with k=1-9 clusters. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret. Results can be observed below:
Tumblr media
Results for k=2,4 and 7can be interpreted due to the presence of a fracture point in this positions. I’ve selected k=4 for my further analysis
To reduce number of variables PCA analysis were performed.
A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) can be seen below:
Tumblr media
Cluster with green dots has low cluster variance somewhat, cluster with purple dots is also packed well enough, with some variance exists.
cluster with yellow dots isn't well separated from purple and Somewhat from cluster with purple , but it is much more spread on the plot
cluster with yellow it is much more spread on the plot, showing high variance in the plot. data is well separated (clusters overlap is not significant) k=>4 is a suitable number for current situation.
Cluster 0, had the highest Likelihood to be affected by Baseline value.
Cluster 3, includes fetals with higher likelihood of affected by Uterine contractions, Light decelerations, Severe decelerations,,Prolongued decelerations, Abnormal short term variability., Mean value of short term variability, Histogram mode
compared to the other two clusters. It also has higher levels affected by Histogram mean, Histogram median, Histogram variance.
In order to validate the clusters, ANOVA analysis was conducted to test for significant differences between the clusters on fetal health. Results indicated significant differences between the clusters on GPA (F(2, 1485)= 420.2, p<.0001). The tukey test showed that clusters differ significantly within fetal health, although difference between cluster 2 and 3 is not significant.
CODE:
!pip install statsmodels
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
%matplotlib inline
RND_STATE = 2226
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install statsmodels
data = pd.read_csv("fetal_health.csv")
data.columns = map(str.upper, data.columns)
data_clean = data.dropna()
AH_data = pd.read_csv("fetal_health.csv")
data_clean = AH_data.dropna()
cluster=data_clean[['baseline value','uterine_contractions','light_decelerations','severe_decelerations',
'prolongued_decelerations',
'abnormal_short_term_variability','mean_value_of_short_term_variability',
'histogram_mode','histogram_mean',
'histogram_median','histogram_variance']]
cluster.describe()
clustervar=cluster.copy()
clustervar['baseline value']=preprocessing.scale(clustervar['baseline value'].astype('float64'))
clustervar['uterine_contractions']=preprocessing.scale(clustervar['uterine_contractions'].astype('float64'))
clustervar['light_decelerations']=preprocessing.scale(clustervar['light_decelerations'].astype('float64'))
clustervar['severe_decelerations']=preprocessing.scale(clustervar['severe_decelerations'].astype('float64'))
clustervar['prolongued_decelerations']=preprocessing.scale(clustervar['prolongued_decelerations'].astype('float64'))
clustervar['abnormal_short_term_variability']=preprocessing.scale(clustervar['abnormal_short_term_variability'].astype('float64'))
clustervar['mean_value_of_short_term_variability']=preprocessing.scale(clustervar['mean_value_of_short_term_variability'].astype('float64'))
clustervar['histogram_mode']=preprocessing.scale(clustervar['histogram_mode'].astype('float64'))
clustervar['histogram_mean']=preprocessing.scale(clustervar['histogram_mean'].astype('float64'))
clustervar['histogram_median']=preprocessing.scale(clustervar['histogram_median'].astype('float64'))
clustervar['histogram_variance']=preprocessing.scale(clustervar['histogram_variance'].astype('float64'))
clus_train, clus_test = train_test_split(clustervar, test_size=0.3, random_state=RND_STATE)
!pip install cluster
clusters=range(1,10)
meandist=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(clus_train)
clusassign=model.predict(clus_train)
meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
/ clus_train.shape[0])
!pip install statsmodels
!pip install matplotlib
!pip install scipy
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
plt.show()
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()
clus_train.reset_index(level=0, inplace=True)
cluslist=list(clus_train['index'])
labels=list(model3.labels_)
newlist=dict(zip(cluslist, labels))
newclus=DataFrame.from_dict(newlist, orient='index')
newclus.columns = ['cluster']
newclus.describe()
newclus.reset_index(level=0, inplace=True)
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
merged_train.cluster.value_counts()
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
gpa_data=data_clean['fetal_health']
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=RND_STATE)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['fetal_health', 'cluster']].dropna()
gpamod = smf.ols(formula='fetal_health ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())
print ('means for fetal health by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
print ('standard deviations for fetal health by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
mc1 = multi.MultiComparison(sub1['fetal_health'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
0 notes
natoli2001 · 4 years ago
Text
The elbow curve was inconclusive, suggesting that the 2, 4 and 8-cluster solutions might be interpreted. The results below are for an interpretation of the 4-cluster solution.  
Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 4 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 2 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 3 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.  
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
0 notes
data-analytical · 4 years ago
Text
K-Means Cluster Analysis
I really didn't enjoy this assignment. It is so involved and there's so little data in my dataset, it is not a terribly good example of the method in action. Also, the sample script doesn't work out of the box, so it was hard initially, to run and compare. Here goes...
I ran a K-Means Cluster Analysis to look at underlying sub-groups of countries based on their similar responses through the following variables:
polityscore
incomeperperson
alcconsumption
breastcancerper100th
co2emissions
employrate
internetuserate
oilperperson
relectricperperson
suicideper100th
urbanrate
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
I did NOT use the SurveySelect Procedure, to split out test/training data, because my initial dataset is only 56 rows (after missing data deletion).
A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Figure 1. Elbow curve of r-square values for the nine cluster solutions.
Tumblr media
The elbow curve was very inconclusive, suggesting that the 2, 3, 5 and 6-cluster solutions might be interpreted. Keeping in mind that I have very little data to cluster (so I cannot be certain about anything), I opted to investigate the 3-cluster solution.
I ran Canonical Discriminant Analysis reducing my 11 variables to a more manageable number.
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster. Means procedure.
Tumblr media
The lack of data reveals itself in the chart. Cluster 2 only appears to have contained only 2 data points! However, clusters 1 and 3 did show a degree of clustering. Cluster 1 was more densely packed than 3, but clearly, both displayed variance within themselves. Both clusters appeared to be distinct. That said, results could be very different, if I had more data to work with!
Because low amount of observations, I feel that a two cluster analysis would be the most effective in this case. However, because cluster 2 has almost no data within it, I am forced to use 3.
The Clustering Means Analysis on the 3 cluster solution shows the following highlights:
Cluster1
Very low (negative) levels of income, alcoholconsumptionm breastcancer, internetusage, oilperperson, retailelectricperperson and urbanization.
Very high levels of employment.
Cluster2
Omitted - because of the low amount of data in the cluster
Cluster3
Very low (negative) levels of income, alcoholconsumptionm breastcancer, internetusage, oilperperson, retailelectricperperson and urbanization.
Very high levels of incomepeperson, breastcancerper100th, co2 emission, electricity.
Moderate to low levels of political score, alcoholconsumption and suicide.
Figure 3. Means Chart.
Tumblr media
Whilst, once again it would be dangerous to draw too many conclusions from such a low dataset, Cluster 3 seems to be a little more comparable to a developed country, perhaps with better medical care (or access to it), which might extend the average persons life.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on lifeexpectancy.
Figure 4. ANOVA Chart and tables.
Tumblr media
I used a tukey test used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on lifeexpectancy (F(2, 53)=17.21, p<.0001).
Countries in cluster 3 had the highest lifeexpectancy (mean=2.36, sd=0.78), and cluster 1 had the lowest (mean=2.67, sd=0.76).
Here is the code that I used to produce this analysis...
Tumblr media
If I had more data, I could probably use this model with more confidence. Because the model tends to expect uniform and more sizable clusters of data (and my data clearly was not), I would be unhappy making any major decisions, based upon this analysis.
0 notes