#component_2 | Explore Tumblr posts and blogs

unnatikoppikar · 4 years ago

Text

Building a K-Means Clustering Pipeline

What Is Clustering?

Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters. In practice, clustering helps identify two qualities of data:

Meaningfulness

Usefulness

There are three popular categories of clustering algorithms:

Partitional clustering

Hierarchical clustering

Density-based clustering

In [1]: import tarfile

...: import urllib ...: ...: import numpy as np ...: import matplotlib.pyplot as plt ...: import pandas as pd ...: import seaborn as sns ...: ...: from sklearn.cluster import KMeans ...: from sklearn.decomposition import PCA ...: from sklearn.metrics import silhouette_score, adjusted_rand_score ...: from sklearn.pipeline import Pipeline ...: from sklearn.preprocessing import LabelEncoder, MinMaxScaler In [2]: uci_tcga_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00401/" ...: archive_name = "TCGA-PANCAN-HiSeq-801x20531.tar.gz" ...: # Build the url ...: full_download_url = urllib.parse.urljoin(uci_tcga_url, archive_name) ...: ...: # Download the file ...: r = urllib.request.urlretrieve (full_download_url, archive_name) ...: # Extract the data from the archive ...: tar = tarfile.open(archive_name, "r:gz") ...: tar.extractall() ...: tar.close()

In [3]: datafile = "TCGA-PANCAN-HiSeq-801x20531/data.csv" ...: labels_file = "TCGA-PANCAN-HiSeq-801x20531/labels.csv" ...: ...: data = np.genfromtxt( ...: datafile, ...: delimiter=",", ...: usecols=range(1, 20532), ...: skip_header=1 ...: ) ...: ...: true_label_names = np.genfromtxt( ...: labels_file, ...: delimiter=",", ...: usecols=(1,), ...: skip_header=1, ...: dtype="str" ...: )

In [4]: data[:5, :3] Out[4]: array([[0. , 2.01720929, 3.26552691], [0. , 0.59273209, 1.58842082], [0. , 3.51175898, 4.32719872], [0. , 3.66361787, 4.50764878], [0. , 2.65574107, 2.82154696]]) In [5]: true_label_names[:5] Out[5]: array(['PRAD', 'LUAD', 'PRAD', 'PRAD', 'BRCA'], dtype='<U4') In [6]: label_encoder = LabelEncoder() In [7]: true_labels = label_encoder.fit_transform(true_label_names) In [8]: true_labels[:5] Out[8]: array([4, 3, 4, 4, 0]) In [9]: label_encoder.classes_ Out[9]: array(['BRCA', 'COAD', 'KIRC', 'LUAD', 'PRAD'], dtype='<U4') In [10]: n_clusters = len(label_encoder.classes_)

In [11]: preprocessor = Pipeline( ...: [ ...: ("scaler", MinMaxScaler()), ...: ("pca", PCA(n_components=2, random_state=42)), ...: ] ...: ) In [12]: clusterer = Pipeline( ...: [ ...: ( ...: "kmeans", ...: KMeans( ...: n_clusters=n_clusters, ...: init="k-means++", ...: n_init=50, ...: max_iter=500, ...: random_state=42, ...: ), ...: ), ...: ] ...: )

In [13]: pipe = Pipeline( ...: [ ...: ("preprocessor", preprocessor), ...: ("clusterer", clusterer) ...: ] ...: ) In [14]: pipe.fit(data) Out[14]: Pipeline(steps=[('preprocessor', Pipeline(steps=[('scaler', MinMaxScaler()), ('pca', PCA(n_components=2, random_state=42))])), ('clusterer', Pipeline(steps=[('kmeans', KMeans(max_iter=500, n_clusters=5, n_init=50, random_state=42))]))])

In [15]: preprocessed_data = pipe["preprocessor"].transform(data) In [16]: predicted_labels = pipe["clusterer"]["kmeans"].labels_ In [17]: silhouette_score(preprocessed_data, predicted_labels) Out[17]: 0.5118775528450304 In [18]: adjusted_rand_score(true_labels, predicted_labels) Out[18]: 0.722276752060253 In [19]: pcadf = pd.DataFrame( ...: pipe["preprocessor"].transform(data), ...: columns=["component_1", "component_2"], ...: ) ...: ...: pcadf["predicted_cluster"] = pipe["clusterer"]["kmeans"].labels_ ...: pcadf["true_label"] = label_encoder.inverse_transform(true_labels) In [20]: plt.style.use("fivethirtyeight") ...: plt.figure(figsize=(8, 8)) ...: ...: scat = sns.scatterplot( ...: "component_1", ...: "component_2", ...: s=50, ...: data=pcadf, ...: hue="predicted_cluster", ...: style="true_label", ...: palette="Set2", ...: ) ...: ...: scat.set_title( ...: "Clustering results from TCGA Pan-Cancer\nGene Expression Data" ...: ) ...: plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0) ...: ...: plt.show()

#Here is what the graph looks like:#Here is what the output graph looks like:

0 notes

thinzarsaw-blog · 5 years ago

Text

Assignment 4 k-mean cluster

from pandas import Series, DataFrame import pandas as pd import numpy as np import seaborn as sns import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.metrics import silhouette_score, adjusted_rand_score from sklearn.decomposition import PCA from sklearn.cluster import KMeans from sklearn.pipeline import Pipeline from sklearn.preprocessing import LabelEncoder, MinMaxScaler #Load the dataset data = pd.read_csv('C:\\Users\\saw\\Documents\\Breast Cancer Wisconsin_Diagnostic.csv') data_clean = data.dropna() data_clean.dtypes data_clean.describe() datafile = "C:\\Users\\saw\\Documents\\Breast Cancer Wisconsin_Diagnostic.csv" labels_file = "C:\\Users\\saw\\Documents\\Breast Cancer Wisconsin_Diagnostic_label.csv" data = np.genfromtxt( datafile, delimiter=",", usecols=range(1, 30), skip_header=1 ) true_label_names = np.genfromtxt( labels_file, delimiter=",", usecols=(1,), skip_header=1, dtype=str ) data[:5, :3] true_label_names[:5] #Encode target label label_encoder = LabelEncoder() true_labels = label_encoder.fit_transform(true_label_names) true_labels[:5] label_encoder.classes_ n_clusters = len(label_encoder.classes_) preprocessor = Pipeline( [ ("scaler", MinMaxScaler()), ("pca", PCA(n_components=2, random_state=42)), ] ) clusterer = Pipeline( [ ( "kmeans", KMeans( n_clusters=n_clusters, init="k-means++", n_init=50, max_iter=500, random_state=42, ), ), ] ) pipe = Pipeline( [ ("preprocessor", preprocessor), ("clusterer", clusterer) ] ) pipe.fit(data) #Predict the data preprocessed_data = pipe["preprocessor"].transform(data) predicted_labels = pipe["clusterer"]["kmeans"].labels_ silhouette_score(preprocessed_data, predicted_labels) adjusted_rand_score(true_labels, predicted_labels) pcadf = pd.DataFrame( pipe["preprocessor"].transform(data), columns=["component_1", "component_2"], ) pcadf["predicted_cluster"] = pipe["clusterer"]["kmeans"].labels_ pcadf["true_label"] = label_encoder.inverse_transform(true_labels) #plot plt.style.use("fivethirtyeight") plt.figure(figsize=(8, 8)) scat = sns.scatterplot( "component_1", "component_2", s=50, data=pcadf, hue="predicted_cluster", style="true_label", palette="Set2", ) scat.set_title( "Clustering results from Breast cancer Data" ) plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

plt.show()

Tuning a K-Means Clustering Pipeline

# Empty lists to hold evaluation metrics silhouette_scores = [] ari_scores = [] for n in range(2, 11): # This set the number of components for pca, # but leaves other steps unchanged pipe["preprocessor"]["pca"].n_components = n pipe.fit(data) silhouette_coef = silhouette_score( pipe["preprocessor"].transform(data), pipe["clusterer"]["kmeans"].labels_, ) ari = adjusted_rand_score( true_labels, pipe["clusterer"]["kmeans"].labels_, ) # Add metrics to their lists silhouette_scores.append(silhouette_coef) ari_scores.append(ari)

plt.style.use("fivethirtyeight") plt.figure(figsize=(6, 6)) plt.plot( range(2, 11), silhouette_scores, c="#008fd5", label="Silhouette Coefficient", ) plt.plot(range(2, 11), ari_scores, c="#fc4f30", label="ARI")plt.xlabel("n_components") plt.legend() plt.title("Clustering Performance\nas a Function of n_components") plt.tight_layout() plt.show()

A k-means cluster analysis was conducted to identify underlying subgroups based on their similarity of responses on 30 variables that represent characteristics that could have an impact on achievement. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

0 notes