psheetal-blog
psheetal-blog
Untitled
1 post
Don't wanna be here? Send us removal request.
psheetal-blog · 5 years ago
Text
K-Means Cluster Analysis
Code
import pandas as pd import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import  datasets from sklearn.cluster import KMeans from sklearn.metrics import accuracy_score from sklearn.decomposition import PCA import seaborn as sns %matplotlib inline rnd_state = 3927
Data
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): sepal length (cm)    150 non-null float64 sepal width (cm)     150 non-null float64 petal length (cm)    150 non-null float64 petal width (cm)     150 non-null float64 target               150 non-null float64 dtypes: float64(5) memory usage: 5.9 KB
Iris Data
Tumblr media
Simple Classifier
plt.figure(figsize=(12,5)) plt.subplot(121) plt.scatter(list(map(lambda tup: tup[0], pca_transformed)),            list(map(lambda tup: tup[1], pca_transformed)),            c=list(map(lambda col: "#9b59b6" if col==0 else "#e74c3c" if col==1 else "#2ecc71", target_test))) plt.title('PCA on Iris data, real classes'); plt.subplot(122) plt.scatter(list(map(lambda tup: tup[0], pca_transformed)),            list(map(lambda tup: tup[1], pca_transformed)),            c=list(map(lambda col: "#9b59b6" if col==0 else "#e74c3c" if col==1 else "#2ecc71", prediction))) plt.title('PCA on Iris data, predicted classes');
Simple Classifier Graph
Tumblr media
Result
A k-means cluster analysis was conducted to identify classes of iris plants based on their similarity of responses on 4 variables that represent characteristics of the each plant bud. Clustering variables included 4 quantitative variables such as: sepal length, sepal width, petal length, and petal width.
Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. Then k-means cluster analyses was conducted on the training data specifying k=3 clusters (representing three classes: Iris Setosa, Iris Versicolour, Iris Virginica), using Euclidean distance.
To describe the performance of a classifier and see what types of errors our classifier is making a confusion matrix was created. The accuracy score is 0.82, which is quite good due to the small number of observation (n=150).
1 note · View note