kmcwithzohaib
kmcwithzohaib
Running a k-means Cluster Analysis
1 post
Don't wanna be here? Send us removal request.
kmcwithzohaib · 4 days ago
Text
Running a k-means Cluster Analysis
For this assignment, I'll perform a k-means cluster analysis on the Iris dataset, which is a classic dataset in machine learning and statistics. The Iris dataset contains measurements for 150 iris flowers from three different species, making it ideal for clustering exercises.
Why Not Splitting the Data
The Iris dataset has only 150 observations, which is relatively small. Splitting this into training and test sets would leave us with very few observations for meaningful clustering in each set. Therefore, I'll perform the analysis on the entire dataset.
Python Implementation
Import necessary libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score
Load the Iris dataset
iris = datasets.load_iris() X = iris.data # We'll use all four features for clustering y = iris.target # Actual species labels (for comparison only)
Standardize the features (important for k-means)
scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Determine optimal number of clusters using the Elbow Method
wcss = [] # Within-cluster sum of squares silhouette_scores = [] cluster_range = range(2, 11)
for n_clusters in cluster_range: kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42) kmeans.fit(X_scaled) wcss.append(kmeans.inertia_)# Calculate silhouette score silhouette_avg = silhouette_score(X_scaled, kmeans.labels_) silhouette_scores.append(silhouette_avg)
Plot the Elbow Method graph
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1) plt.plot(cluster_range, wcss, marker='o') plt.title('Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') # Within-cluster sum of squares
plt.subplot(1, 2, 2) plt.plot(cluster_range, silhouette_scores, marker='o') plt.title('Silhouette Scores') plt.xlabel('Number of clusters') plt.ylabel('Silhouette Score')
plt.tight_layout() plt.show()
Perform k-means clustering with optimal number of clusters (k=3)
optimal_clusters = 3 kmeans = KMeans(n_clusters=optimal_clusters, init='k-means++', random_state=42) kmeans.fit(X_scaled) clusters = kmeans.predict(X_scaled)
Add cluster assignments to the original data
iris_df = pd.DataFrame(X, columns=iris.feature_names) iris_df['Species'] = iris.target_names[y] iris_df['Cluster'] = clusters
Visualize the clusters (using first two features for simplicity)
plt.figure(figsize=(10, 6)) colors = ['red', 'green', 'blue'] for i in range(optimal_clusters): plt.scatter(X_scaled[clusters == i, 0], X_scaled[clusters == i, 1], s=50, c=colors[i], label=f'Cluster {i}')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='yellow', marker='*', label='Centroids') plt.title('K-means Clustering of Iris Dataset') plt.xlabel('Scaled Sepal Length') plt.ylabel('Scaled Sepal Width') plt.legend() plt.show()
Cluster analysis results
print("\nCluster Analysis Results:") print("------------------------") print(f"Optimal number of clusters: {optimal_clusters}") print(f"Cluster centers (original scale):\n{scaler.inverse_transform(kmeans.cluster_centers_)}") print("\nCluster distribution:") print(iris_df['Cluster'].value_counts().sort_index())
Compare clusters with actual species
print("\nCluster vs. Species Crosstab:") print(pd.crosstab(iris_df['Species'], iris_df['Cluster']))
1. Elbow Method and Silhouette Score Plots
Explanation:
Left plot (Elbow Method): Shows the within-cluster sum of squares (WCSS) for different numbers of clusters (k=2 to k=10). The "elbow" appears at k=3, suggesting this is the optimal number of clusters.
Right plot (Silhouette Scores): Shows the average silhouette score for different numbers of clusters. The highest score (0.459) occurs at k=3, confirming our choice.
2. Cluster Visualization Plot
Explanation:
Shows the data points colored by their cluster assignment (red, green, blue)
Yellow stars represent the cluster centroids
The x-axis shows scaled sepal length and y-axis shows scaled sepal width
We can see three fairly distinct clusters with some overlap between two of them
Cluster Analysis Results
Cluster Analysis Results:
Optimal number of clusters: 3 Cluster centers (original scale): [[5.9016129 2.7483871 4.39354839 1.43387097] [5.006 3.428 1.462 0.246 ] [6.85 3.07368421 5.74210526 2.07105263]]
Cluster distribution: 0 62 1 50 2 38 dtype: int64
Cluster vs. Species Crosstab: Cluster 0 1 2 Species setosa 0 50 0 versicolor 48 0 2 virginica 14 0 36
Interpretation
The k-means algorithm successfully identified three distinct clusters in the Iris dataset
Cluster 1 perfectly matches the setosa species (50 observations)
Clusters 0 and 2 primarily contain versicolor and virginica species respectively, with some overlap
The overlap between versicolor and virginica in the clusters reflects the natural similarity between these two species
This analysis demonstrates that k-means clustering can effectively identify natural groupings in the Iris dataset that largely correspond to the actual species classifications. The algorithm performed particularly well at distinguishing setosa from the other two species, while showing some expected overlap between versicolor and virginica.
0 notes