kmcwithzohaib - Tumblr blog

kmcwithzohaib · 4 days ago

Text

Running a k-means Cluster Analysis

For this assignment, I'll perform a k-means cluster analysis on the Iris dataset, which is a classic dataset in machine learning and statistics. The Iris dataset contains measurements for 150 iris flowers from three different species, making it ideal for clustering exercises.

Why Not Splitting the Data

The Iris dataset has only 150 observations, which is relatively small. Splitting this into training and test sets would leave us with very few observations for meaningful clustering in each set. Therefore, I'll perform the analysis on the entire dataset.

Python Implementation

Import necessary libraries

import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score

Load the Iris dataset

iris = datasets.load_iris() X = iris.data # We'll use all four features for clustering y = iris.target # Actual species labels (for comparison only)

Standardize the features (important for k-means)

scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Determine optimal number of clusters using the Elbow Method

wcss = [] # Within-cluster sum of squares silhouette_scores = [] cluster_range = range(2, 11)

for n_clusters in cluster_range: kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42) kmeans.fit(X_scaled) wcss.append(kmeans.inertia_)# Calculate silhouette score silhouette_avg = silhouette_score(X_scaled, kmeans.labels_) silhouette_scores.append(silhouette_avg)

Plot the Elbow Method graph

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1) plt.plot(cluster_range, wcss, marker='o') plt.title('Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') # Within-cluster sum of squares

plt.subplot(1, 2, 2) plt.plot(cluster_range, silhouette_scores, marker='o') plt.title('Silhouette Scores') plt.xlabel('Number of clusters') plt.ylabel('Silhouette Score')

plt.tight_layout() plt.show()

Perform k-means clustering with optimal number of clusters (k=3)

optimal_clusters = 3 kmeans = KMeans(n_clusters=optimal_clusters, init='k-means++', random_state=42) kmeans.fit(X_scaled) clusters = kmeans.predict(X_scaled)

Add cluster assignments to the original data

iris_df = pd.DataFrame(X, columns=iris.feature_names) iris_df['Species'] = iris.target_names[y] iris_df['Cluster'] = clusters

Visualize the clusters (using first two features for simplicity)

plt.figure(figsize=(10, 6)) colors = ['red', 'green', 'blue'] for i in range(optimal_clusters): plt.scatter(X_scaled[clusters == i, 0], X_scaled[clusters == i, 1], s=50, c=colors[i], label=f'Cluster {i}')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='yellow', marker='*', label='Centroids') plt.title('K-means Clustering of Iris Dataset') plt.xlabel('Scaled Sepal Length') plt.ylabel('Scaled Sepal Width') plt.legend() plt.show()

Cluster analysis results

print("\nCluster Analysis Results:") print("------------------------") print(f"Optimal number of clusters: {optimal_clusters}") print(f"Cluster centers (original scale):\n{scaler.inverse_transform(kmeans.cluster_centers_)}") print("\nCluster distribution:") print(iris_df['Cluster'].value_counts().sort_index())

Compare clusters with actual species

print("\nCluster vs. Species Crosstab:") print(pd.crosstab(iris_df['Species'], iris_df['Cluster']))

1. Elbow Method and Silhouette Score Plots

Explanation:

Left plot (Elbow Method): Shows the within-cluster sum of squares (WCSS) for different numbers of clusters (k=2 to k=10). The "elbow" appears at k=3, suggesting this is the optimal number of clusters.

Right plot (Silhouette Scores): Shows the average silhouette score for different numbers of clusters. The highest score (0.459) occurs at k=3, confirming our choice.

2. Cluster Visualization Plot

Explanation:

Shows the data points colored by their cluster assignment (red, green, blue)

Yellow stars represent the cluster centroids

The x-axis shows scaled sepal length and y-axis shows scaled sepal width

We can see three fairly distinct clusters with some overlap between two of them

Cluster Analysis Results

Cluster Analysis Results:

Optimal number of clusters: 3 Cluster centers (original scale): [[5.9016129 2.7483871 4.39354839 1.43387097] [5.006 3.428 1.462 0.246 ] [6.85 3.07368421 5.74210526 2.07105263]]

Cluster distribution: 0 62 1 50 2 38 dtype: int64

Cluster vs. Species Crosstab: Cluster 0 1 2 Species setosa 0 50 0 versicolor 48 0 2 virginica 14 0 36

Interpretation

The k-means algorithm successfully identified three distinct clusters in the Iris dataset

Cluster 1 perfectly matches the setosa species (50 observations)

Clusters 0 and 2 primarily contain versicolor and virginica species respectively, with some overlap

The overlap between versicolor and virginica in the clusters reflects the natural similarity between these two species

This analysis demonstrates that k-means clustering can effectively identify natural groupings in the Iris dataset that largely correspond to the actual species classifications. The algorithm performed particularly well at distinguishing setosa from the other two species, while showing some expected overlap between versicolor and virginica.

0 notes