samrats - Tumblr blog

samrats · 3 years ago

Text

Task

This week’s assignment involves running a k-means cluster analysis. Cluster analysis is an unsupervised machine learning method that partitions the observations in a data set into a smaller set of clusters where each observation belongs to only one cluster. The goal of cluster analysis is to group, or cluster, observations into subsets based on their similarity of responses on multiple variables. Clustering variables should be primarily quantitative variables, but binary variables may also be included.

Your assignment is to run a k-means cluster analysis to identify subgroups of observations in your data set that have similar patterns of response on a set of clustering variables.

Data

This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.

Attribute Information:

sepal length in cm

sepal width in cm

petal length in cm

petal width in cm

class:

Iris Setosa

Iris Versicolour

Iris Virginica

Results

A k-means cluster analysis was conducted to identify classes of iris plants based on their similarity of responses on 4 variables that represent characteristics of the each plant bud. Clustering variables included 4 quantitative variables such as: sepal length, sepal width, petal length, and petal width.

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. Then k-means cluster analyses was conducted on the training data specifying k=3 clusters (representing three classes: Iris Setosa, Iris Versicolour, Iris Virginica), using Euclidean distance.

To describe the performance of a classifier and see what types of errors our classifier is making a confusion matrix was created. The accuracy score is 0.82, which is quite good due to the small number of observation (n=150).

In [73]:import numpy as np import pandas as pd import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn.cluster import KMeans from sklearn.metrics import accuracy_score from sklearn.decomposition import PCA import seaborn as sns %matplotlib inline rnd_state = 3927

In [2]:iris = datasets.load_iris() data = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target']) data.head()

Out[2]:sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target05.13.51.40.20.014.93.01.40.20.024.73.21.30.20.034.63.11.50.20.045.03.61.40.20.0

In [66]:data.info() RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): sepal length (cm) 150 non-null float64 sepal width (cm) 150 non-null float64 petal length (cm) 150 non-null float64 petal width (cm) 150 non-null float64 target 150 non-null float64 dtypes: float64(5) memory usage: 5.9 KB

In [3]:data.describe()

Out[3]:sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)targetcount150.000000150.000000150.000000150.000000150.000000mean5.8433333.0540003.7586671.1986671.000000std0.8280660.4335941.7644200.7631610.819232min4.3000002.0000001.0000000.1000000.00000025%5.1000002.8000001.6000000.3000000.00000050%5.8000003.0000004.3500001.3000001.00000075%6.4000003.3000005.1000001.8000002.000000max7.9000004.4000006.9000002.5000002.000000

In [4]:pca_transformed = PCA(n_components=2).fit_transform(data.iloc[:, :4])

In [7]:colors=["#9b59b6", "#e74c3c", "#2ecc71"] plt.figure(figsize=(12,5)) plt.subplot(121) plt.scatter(list(map(lambda tup: tup[0], pca_transformed)), list(map(lambda tup: tup[1], pca_transformed)), c=list(map(lambda col: "#9b59b6" if col==0 else "#e74c3c" if col==1 else "#2ecc71", data.target))) plt.title('PCA on Iris data') plt.subplot(122) sns.countplot(data.target, palette=sns.color_palette(colors)) plt.title('Countplot Iris classes');

For visualization purposes, the number of dimensions was reduced to two by applying PCA analysis. The plot illustrates that classes 1 and 2 are not clearly divided. Countplot illustrates that our classes contain the same number of observations (n=50), so they are balanced.

In [85]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(data.iloc[:, :4], data.target, test_size = .3, random_state = rnd_state)

In [86]:classifier = KMeans(n_clusters=3).fit(predictors_train) prediction = classifier.predict(predictors_test)

In [87]:pca_transformed = PCA(n_components=2).fit_transform(predictors_test)

Predicted classes 1 and 2 mismatch the real ones, so the code block below fixes that problem.

In [88]:prediction = np.where(prediction==1, 3, prediction) prediction = np.where(prediction==2, 1, prediction) prediction = np.where(prediction==3, 2, prediction)

In [91]:plt.figure(figsize=(12,5)) plt.subplot(121) plt.scatter(list(map(lambda tup: tup[0], pca_transformed)), list(map(lambda tup: tup[1], pca_transformed)), c=list(map(lambda col: "#9b59b6" if col==0 else "#e74c3c" if col==1 else "#2ecc71", target_test))) plt.title('PCA on Iris data, real classes'); plt.subplot(122) plt.scatter(list(map(lambda tup: tup[0], pca_transformed)), list(map(lambda tup: tup[1], pca_transformed)), c=list(map(lambda col: "#9b59b6" if col==0 else "#e74c3c" if col==1 else "#2ecc71", prediction))) plt.title('PCA on Iris data, predicted classes');

The figure shows that our simple classifier did a good job in identifing the classes, despite the few mistakes.

In [78]:clust_df = predictors_train.reset_index(level=[0]) clust_df.drop('index', axis=1, inplace=True) clust_df['cluster'] = classifier.labels_

In [79]:clust_df.head()

Out[79]:sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)cluster05.72.84.51.3015.62.74.21.3027.13.05.92.1236.53.05.82.2245.93.04.21.50

In [80]:print ('Clustering variable means by cluster') clust_df.groupby('cluster').mean() Clustering variable means by cluster

Out[80]:sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)cluster05.8590912.7909094.3431821.41590914.9897443.4256411.4717950.24871826.8863643.0909095.8545452.077273

In [92]:print('Confusion matrix:\n', pd.crosstab(target_test, prediction, colnames=['Actual'], rownames=['Predicted'], margins=True)) print('\nAccuracy: ', accuracy_score(target_test, prediction)) Confusion matrix: Actual 0 1 2 All Predicted 0.0 11 0 0 11 1.0 0 11 1 12 2.0 0 7 15 22 All 11 18 16 45 Accuracy: 0.8222222222222222

0 notes

samrats · 3 years ago

Text

Task

This week’s assignment involves running a lasso regression analysis. Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.

Your assignment is to run a lasso regression analysis using k-fold cross validation to identify a subset of predictors from a larger pool of predictor variables that best predicts a quantitative response variable.

Data

Dataset description: hourly rental data spanning two years.

Dataset can be found at Kaggle

Features:

yr - year

mnth - month

season - 1 = spring, 2 = summer, 3 = fall, 4 = winter

holiday - whether the day is considered a holiday

workingday - whether the day is neither a weekend nor holiday

weathersit - 1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp - temperature in Celsius

atemp - "feels like" temperature in Celsius

hum - relative humidity

windspeed (mph) - wind speed, miles per hour

windspeed (ms) - wind speed, metre per second

Target:

cnt - number of total rentals

Results

A lasso regression analysis was conducted to predict a number of total bikes rentals from a pool of 12 categorical and quantitative predictor variables that best predicted a quantitative response variable. Categorical predictors included weather condition and a series of 2 binary categorical variables for holiday and workingday to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include year, month, temperature, humidity and wind speed.

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Of the 12 predictor variables, 10 were retained in the selected model:

atemp: 63.56915200306693

holiday: -282.431748735072

hum: -12.815264427009353

mnth: 0.0

season: 381.77762475080044

temp: 58.035647703871234

weathersit: -514.6381162101678

weekday: 69.84812053893549

windspeed(mph): 0.0

windspeed(ms): -95.71090321577515

workingday: 36.15135752613271

yr: 2091.5182927517903

Train data R-square 0.7899877818517489 Test data R-square 0.8131871527614188

During the estimation process, year and season were most strongly associated with the number of total bikes rentals, followed by temperature and weekday. Holiday, humidity, weather condition and wind speed (ms) were negatively associated with the number of total bikes rentals.

In [1]:import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LassoLarsCV from sklearn import preprocessing from sklearn.metrics import mean_squared_error import seaborn as sns %matplotlib inline rnd_state = 983

In [2]:data = pd.read_csv("data/bikes_rent.csv") data.info() RangeIndex: 731 entries, 0 to 730 Data columns (total 13 columns): season 731 non-null int64 yr 731 non-null int64 mnth 731 non-null int64 holiday 731 non-null int64 weekday 731 non-null int64 workingday 731 non-null int64 weathersit 731 non-null int64 temp 731 non-null float64 atemp 731 non-null float64 hum 731 non-null float64 windspeed(mph) 731 non-null float64 windspeed(ms) 731 non-null float64 cnt 731 non-null int64 dtypes: float64(5), int64(8) memory usage: 74.3 KB

In [3]:data.describe()

Out[3]:seasonyrmnthholidayweekdayworkingdayweathersittempatemphumwindspeed(mph)windspeed(ms)cntcount731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000mean2.4965800.5006846.5198360.0287282.9972640.6839951.39534920.31077623.71769962.78940612.7625765.7052204504.348837std1.1108070.5003423.4519130.1671552.0047870.4652330.5448947.5050918.14805914.2429105.1923572.3211251937.211452min1.0000000.0000001.0000000.0000000.0000000.0000001.0000002.4243463.9534800.0000001.5002440.67065022.00000025%2.0000000.0000004.0000000.0000001.0000000.0000001.00000013.82042416.89212552.0000009.0416504.0418643152.00000050%3.0000001.0000007.0000000.0000003.0000001.0000001.00000020.43165324.33665062.66670012.1253255.4203514548.00000075%3.0000001.00000010.0000000.0000005.0000001.0000002.00000026.87207730.43010073.02085015.6253716.9849675956.000000max4.0000001.00000012.0000001.0000006.0000001.0000003.00000035.32834742.04480097.25000034.00002115.1989378714.000000

In [4]:data.head()

Out[4]:seasonyrmnthholidayweekdayworkingdayweathersittempatemphumwindspeed(mph)windspeed(ms)cnt0101060214.11084718.1812580.583310.7498824.8054909851101000214.90259817.6869569.608716.6521137.443949801210101118.0509249.4702543.727316.6367037.4370601349310102118.20000010.6061059.043510.7398324.8009981562410103119.30523711.4635043.695712.5223005.5978101600

In [5]:data.dropna(inplace=True)

In [17]:fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(20, 10)) for idx, feature in enumerate(data.columns.values[:-1]): data.plot(feature, 'cnt', subplots=True, kind='scatter', ax=axes[int(idx / 4), idx % 4], c='#87486e');

The plot above shows that there is a linear dependence between temp, atemp and cnt features. The correlations below confirm that observation.

In [7]:data.iloc[:, :12].corrwith(data['cnt'])

Out[7]:season 0.406100 yr 0.566710 mnth 0.279977 holiday -0.068348 weekday 0.067443 workingday 0.061156 weathersit -0.297391 temp 0.627494 atemp 0.631066 hum -0.100659 windspeed(mph) -0.234545 windspeed(ms) -0.234545 dtype: float64

In [8]:plt.figure(figsize=(15, 5)) sns.heatmap(data[['temp', 'atemp', 'hum', 'windspeed(mph)', 'windspeed(ms)', 'cnt']].corr(), annot=True, fmt='1.4f');

There is a strong correlation between temp and atemp, as well as windspeed(mph) and windspeed(ms) features, due to the fact that they represent similar metrics in different measures. In further analysis two of those features must be dropped or applyed with penalty (L2 or Lasso regression).

In [9]:predictors = data.iloc[:, :12] target = data['cnt']

In [10]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(predictors, target, test_size = .3, random_state = rnd_state)

In [11]:model = LassoLarsCV(cv=10, precompute=False).fit(predictors_train, target_train)

In [12]:dict(zip(predictors.columns, model.coef_))

Out[12]:{'atemp': 63.56915200306693, 'holiday': -282.431748735072, 'hum': -12.815264427009353, 'mnth': 0.0, 'season': 381.77762475080044, 'temp': 58.035647703871234, 'weathersit': -514.6381162101678, 'weekday': 69.84812053893549, 'windspeed(mph)': 0.0, 'windspeed(ms)': -95.71090321577515, 'workingday': 36.15135752613271, 'yr': 2091.5182927517903}

In [13]:log_alphas =-np.log10(model.alphas_) plt.figure(figsize=(10, 5)) for idx, feature in enumerate(predictors.columns): plt.plot(log_alphas, list(map(lambda r: r[idx], model.coef_path_.T)), label=feature) plt.legend(loc="upper right", bbox_to_anchor=(1.4, 0.95)) plt.xlabel("-log10(alpha)") plt.ylabel("Feature weight") plt.title("Lasso");

In [14]:log_cv_alphas =-np.log10(model.cv_alphas_) plt.figure(figsize=(10, 5)) plt.plot(log_cv_alphas, model.mse_path_, ':') plt.plot(log_cv_alphas, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log10(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold');

In [16]:rsquared_train = model.score(predictors_train, target_train) rsquared_test = model.score(predictors_test, target_test) print('Train data R-square', rsquared_train) print('Test data R-square', rsquared_test) Train data R-square 0.7899877818517489 Test data R-square 0.8131871527614188

0 notes

samrats · 3 years ago

Text

Task

The second assignment deals with Random Forests. Random forests are predictive models that allow for a data driven exploration of many explanatory variables in predicting a response or target variable. Random forests provide importance scores for each explanatory variable and also allow you to evaluate any increases in correct classification with the growing of smaller and larger number of trees.

Run a Random Forest.

You will need to perform a random forest analysis to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable.

Data

The dataset is related to red variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Dataset can be found at UCI Machine Learning Repository

Attribute Information (For more information, read [Cortez et al., 2009]): Input variables (based on physicochemical tests):

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Results

Random forest and ExtraTrees classifier were deployed to evaluate the importance of a series of explanatory variables in predicting a categorical response variable - red wine quality (score between 0 and 10). The following explanatory variables were included: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol.

The explanatory variables with the highest importance score (evaluated by both classifiers) are alcohol, volatile acidity, sulphates. The accuracy of the Random forest and ExtraTrees clasifier is about 67%, which is quite good for highly unbalanced and hardly distinguished from each other classes. The subsequent growing of multiple trees rather than a single tree, adding a lot to the overall score of the model. For Random forest the number of estimators is 20, while for ExtraTrees classifier - 12, because the second classifier grows up much faster.

Code

In [1]:import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.manifold import MDS from sklearn.metrics.pairwise import pairwise_distances from sklearn.metrics import accuracy_score import seaborn as sns %matplotlib inline rnd_state = 4536

In [2]:data = pd.read_csv('Data\winequality-red.csv', sep=';') data.info() RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): fixed acidity 1599 non-null float64 volatile acidity 1599 non-null float64 citric acid 1599 non-null float64 residual sugar 1599 non-null float64 chlorides 1599 non-null float64 free sulfur dioxide 1599 non-null float64 total sulfur dioxide 1599 non-null float64 density 1599 non-null float64 pH 1599 non-null float64 sulphates 1599 non-null float64 alcohol 1599 non-null float64 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB

In [3]:data.head()

Out[3]:fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality07.40.700.001.90.07611.034.00.99783.510.569.4517.80.880.002.60.09825.067.00.99683.200.689.8527.80.760.042.30.09215.054.00.99703.260.659.85311.20.280.561.90.07517.060.00.99803.160.589.8647.40.700.001.90.07611.034.00.99783.510.569.45

In [4]:data.describe()

Out[4]:fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitycount1599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.000000mean8.3196370.5278210.2709762.5388060.08746715.87492246.4677920.9967473.3111130.65814910.4229835.636023std1.7410960.1790600.1948011.4099280.04706510.46015732.8953240.0018870.1543860.1695071.0656680.807569min4.6000000.1200000.0000000.9000000.0120001.0000006.0000000.9900702.7400000.3300008.4000003.00000025%7.1000000.3900000.0900001.9000000.0700007.00000022.0000000.9956003.2100000.5500009.5000005.00000050%7.9000000.5200000.2600002.2000000.07900014.00000038.0000000.9967503.3100000.62000010.2000006.00000075%9.2000000.6400000.4200002.6000000.09000021.00000062.0000000.9978353.4000000.73000011.1000006.000000max15.9000001.5800001.00000015.5000000.61100072.000000289.0000001.0036904.0100002.00000014.9000008.000000

Plots

For visualization purposes, the number of dimensions was reduced to two by applying MDS method with cosine distance. The plot illustrates that our classes are not clearly divided into parts.

In [5]:model = MDS(random_state=rnd_state, n_components=2, dissimilarity='precomputed') %time representation = model.fit_transform(pairwise_distances(data.iloc[:, :11], metric='cosine')) Wall time: 38.7 s

In [6]:colors = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"] plt.figure(figsize=(12, 4)) plt.subplot(121) plt.scatter(representation[:, 0], representation[:, 1], c=colors) plt.subplot(122) sns.countplot(x='quality', data=data, palette=sns.color_palette(colors));

Moreover, our classes are highly unbalanced, so in our classifier we should add parameter class_weight='balanced'.

In [7]:predictors = data.iloc[:, :11] target = data.quality

In [8]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(predictors, target, test_size = .3, random_state = rnd_state)

RandomForest classifier

In [9]:list_estimators = list(range(1, 50, 5)) rf_scoring = [] for n_estimators in list_estimators: classifier = RandomForestClassifier(random_state = rnd_state, n_jobs =-1, class_weight='balanced', n_estimators=n_estimators) score = cross_val_score(classifier, predictors_train, target_train, cv=5, n_jobs=-1, scoring = 'accuracy') rf_scoring.append(score.mean())

In [10]:plt.plot(list_estimators, rf_scoring) plt.title('Accuracy VS trees number');

In [11]:classifier = RandomForestClassifier(random_state = rnd_state, n_jobs =-1, class_weight='balanced', n_estimators=20) classifier.fit(predictors_train, target_train)

Out[11]:RandomForestClassifier(bootstrap=True, class_weight='balanced', criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1, oob_score=False, random_state=4536, verbose=0, warm_start=False)

In [12]:prediction = classifier.predict(predictors_test)

In [13]:print('Confusion matrix:\n', pd.crosstab(target_test, prediction, colnames=['Predicted'], rownames=['Actual'], margins=True)) print('\nAccuracy: ', accuracy_score(target_test, prediction)) Confusion matrix: Predicted 3 4 5 6 7 All Actual 3 0 0 3 0 0 3 4 0 1 9 6 0 16 5 2 1 166 41 3 213 6 0 0 46 131 14 191 7 0 0 5 25 23 53 8 0 0 0 3 1 4 All 2 2 229 206 41 480 Accuracy: 0.66875

In [14]:feature_importance = pd.Series(classifier.feature_importances_, index=data.columns.values[:11]).sort_values(ascending=False) feature_importance

Out[14]:volatile acidity 0.133023 alcohol 0.130114 sulphates 0.129498 citric acid 0.106427 total sulfur dioxide 0.094647 chlorides 0.086298 density 0.079843 pH 0.066566 residual sugar 0.061344 fixed acidity 0.058251 free sulfur dioxide 0.053990 dtype: float64

In [15]:et_scoring = [] for n_estimators in list_estimators: classifier = ExtraTreesClassifier(random_state = rnd_state, n_jobs =-1, class_weight='balanced', n_estimators=n_estimators) score = cross_val_score(classifier, predictors_train, target_train, cv=5, n_jobs=-1, scoring = 'accuracy') et_scoring.append(score.mean())

In [16]:plt.plot(list_estimators, et_scoring) plt.title('Accuracy VS trees number');

In [17]:classifier = ExtraTreesClassifier(random_state = rnd_state, n_jobs =-1, class_weight='balanced', n_estimators=12) classifier.fit(predictors_train, target_train)

Out[17]:ExtraTreesClassifier(bootstrap=False, class_weight='balanced', criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=12, n_jobs=-1, oob_score=False, random_state=4536, verbose=0, warm_start=False)

In [18]:prediction = classifier.predict(predictors_test)

In [19]:print('Confusion matrix:\n', pd.crosstab(target_test, prediction, colnames=['Predicted'], rownames=['Actual'], margins=True)) print('\nAccuracy: ', accuracy_score(target_test, prediction)) Confusion matrix: Predicted 3 4 5 6 7 8 All Actual 3 0 1 2 0 0 0 3 4 0 0 9 7 0 0 16 5 2 2 168 39 2 0 213 6 0 0 49 130 11 1 191 7 0 0 2 27 24 0 53 8 0 0 0 3 1 0 4 All 2 3 230 206 38 1 480 Accuracy: 0.6708333333333333

In [20]:feature_importance = pd.Series(classifier.feature_importances_, index=data.columns.values[:11]).sort_values(ascending=False) feature_importance

Out[20]:alcohol 0.157267 volatile acidity 0.132768 sulphates 0.100874 citric acid 0.095077 density 0.082334 chlorides 0.079283 total sulfur dioxide 0.076803 pH 0.074638 fixed acidity 0.069826 residual sugar 0.066551 free sulfur dioxide 0.064579 dtype: float64

0 notes

samrats · 3 years ago

Text

Task

This week’s assignment involves decision trees, and more specifically, classification trees. Decision trees are predictive models that allow for a data driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. When the response variable is categorical (two levels), the model is a called a classification tree. Explanatory variables can be either quantitative, categorical or both. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again which choose variable constellations that best predict the response (i.e. target) variable.

Run a Classification Tree.

You will need to perform a decision tree analysis to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable.

Data

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

Dataset can be found at UCI Machine Learning Repository

In this Assignment the Decision tree has been applied to classification of breast cancer detection.

Attribute Information:

id - ID number

diagnosis (M = malignant, B = benign)

3-32 extra features

Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

All feature values are recoded with four significant digits. Missing attribute values: none Class distribution: 357 benign, 212 malignant

Results

Generated decision tree can be found below:

In [17]:img

Out[17]:

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable (breast cancer diagnosis: malignant or benign).

The dataset was splitted into train and test samples in ratio 70\30.

After fitting the classifier the key metrics were calculated - confusion matrix and accuracy = 0.924. This is a good result for a model trained on a small dataset.

From decision tree we can observe:

The malignant tumor is tend to have much more visible affected areas, texture and concave points, while the benign's characteristics are significantly lower.

The most important features are:

concave points_worst = 0.707688

area_worst = 0.114771

concave points_mean = 0.034234

fractal_dimension_se = 0.026301

texture_worst = 0.026300

area_se = 0.025201

concavity_se = 0.024540

texture_mean = 0.023671

perimeter_mean = 0.010415

concavity_mean = 0.006880

Code

In [1]:import pandas as pd import numpy as np from sklearn.metrics import*from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus from sklearn.manifold import TSNE from matplotlib import pyplot as plt %matplotlib inline rnd_state = 23468

Load data

In [2]:data = pd.read_csv('Data/breast_cancer.csv') data.info() RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): id 569 non-null int64 diagnosis 569 non-null object radius_mean 569 non-null float64 texture_mean 569 non-null float64 perimeter_mean 569 non-null float64 area_mean 569 non-null float64 smoothness_mean 569 non-null float64 compactness_mean 569 non-null float64 concavity_mean 569 non-null float64 concave points_mean 569 non-null float64 symmetry_mean 569 non-null float64 fractal_dimension_mean 569 non-null float64 radius_se 569 non-null float64 texture_se 569 non-null float64 perimeter_se 569 non-null float64 area_se 569 non-null float64 smoothness_se 569 non-null float64 compactness_se 569 non-null float64 concavity_se 569 non-null float64 concave points_se 569 non-null float64 symmetry_se 569 non-null float64 fractal_dimension_se 569 non-null float64 radius_worst 569 non-null float64 texture_worst 569 non-null float64 perimeter_worst 569 non-null float64 area_worst 569 non-null float64 smoothness_worst 569 non-null float64 compactness_worst 569 non-null float64 concavity_worst 569 non-null float64 concave points_worst 569 non-null float64 symmetry_worst 569 non-null float64 fractal_dimension_worst 569 non-null float64 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB

In the output above there is an empty column 'Unnamed: 32', so next it should be dropped.

In [3]:data.drop('Unnamed: 32', axis=1, inplace=True) data.diagnosis = np.where(data.diagnosis=='M', 1, 0) # Decode diagnosis into binary data.describe()

Out[3]:iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worstcount5.690000e+02569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000...569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000mean3.037183e+070.37258314.12729219.28964991.969033654.8891040.0963600.1043410.0887990.048919...16.26919025.677223107.261213880.5831280.1323690.2542650.2721880.1146060.2900760.083946std1.250206e+080.4839183.5240494.30103624.298981351.9141290.0140640.0528130.0797200.038803...4.8332426.14625833.602542569.3569930.0228320.1573360.2086240.0657320.0618670.018061min8.670000e+030.0000006.9810009.71000043.790000143.5000000.0526300.0193800.0000000.000000...7.93000012.02000050.410000185.2000000.0711700.0272900.0000000.0000000.1565000.05504025%8.692180e+050.00000011.70000016.17000075.170000420.3000000.0863700.0649200.0295600.020310...13.01000021.08000084.110000515.3000000.1166000.1472000.1145000.0649300.2504000.07146050%9.060240e+050.00000013.37000018.84000086.240000551.1000000.0958700.0926300.0615400.033500...14.97000025.41000097.660000686.5000000.1313000.2119000.2267000.0999300.2822000.08004075%8.813129e+061.00000015.78000021.800000104.100000782.7000000.1053000.1304000.1307000.074000...18.79000029.720000125.4000001084.0000000.1460000.3391000.3829000.1614000.3179000.092080max9.113205e+081.00000028.11000039.280000188.5000002501.0000000.1634000.3454000.4268000.201200...36.04000049.540000251.2000004254.0000000.2226001.0580001.2520000.2910000.6638000.207500

8 rows × 32 columns

In [4]:data.head()

Out[4]:iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worst0842302117.9910.38122.801001.00.118400.277600.30010.14710...25.3817.33184.602019.00.16220.66560.71190.26540.46010.118901842517120.5717.77132.901326.00.084740.078640.08690.07017...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902284300903119.6921.25130.001203.00.109600.159900.19740.12790...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758384348301111.4220.3877.58386.10.142500.283900.24140.10520...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300484358402120.2914.34135.101297.00.100300.132800.19800.10430...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678

5 rows × 32 columns

Plots

For visualization purposes, the number of dimensions was reduced to two by applying t-SNE method. The plot illustrates that our classes are not clearly divided into two parts, so the nonlinear methods (like Decision tree) may solve this problem.

In [15]:model = TSNE(random_state=rnd_state, n_components=2) representation = model.fit_transform(data.iloc[:, 2:])

In [16]:plt.scatter(representation[:, 0], representation[:, 1], c=data.diagnosis, alpha=0.5, cmap=plt.cm.get_cmap('Set1', 2)) plt.colorbar(ticks=range(2));

Decision tree

In [6]:predictors = data.iloc[:, 2:] target = data.diagnosis

To train a Decision tree the dataset was splitted into train and test samples in proportion 70/30.

In [7]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(predictors, target, test_size = .3, random_state = rnd_state)

In [8]:print('predictors_train:', predictors_train.shape) print('predictors_test:', predictors_test.shape) print('target_train:', target_train.shape) print('target_test:', target_test.shape) predictors_train: (398, 30) predictors_test: (171, 30) target_train: (398,) target_test: (171,)

In [9]:print(np.sum(target_train==0)) print(np.sum(target_train==1)) 253 145

Our train sample is quite balanced, so there is no need in balancing it.

In [10]:classifier = DecisionTreeClassifier(random_state = rnd_state).fit(predictors_train, target_train)

In [11]:prediction = classifier.predict(predictors_test)

In [12]:print('Confusion matrix:\n', pd.crosstab(target_test, prediction, colnames=['Actual'], rownames=['Predicted'], margins=True)) print('\nAccuracy: ', accuracy_score(target_test, prediction)) Confusion matrix: Actual 0 1 All Predicted 0 96 8 104 1 5 62 67 All 101 70 171 Accuracy: 0.9239766081871345

In [13]:out = StringIO() tree.export_graphviz(classifier, out_file = out, feature_names = predictors_train.columns.values, proportion =True, filled =True) graph = pydotplus.graph_from_dot_data(out.getvalue()) img = Image(data = graph.create_png()) with open('output.png', 'wb') as f: f.write(img.data)

In [14]:feature_importance = pd.Series(classifier.feature_importances_, index=data.columns.values[2:]).sort_values(ascending=False) feature_importance

Out[14]:concave points_worst 0.707688 area_worst 0.114771 concave points_mean 0.034234 fractal_dimension_se 0.026301 texture_worst 0.026300 area_se 0.025201 concavity_se 0.024540 texture_mean 0.023671 perimeter_mean 0.010415 concavity_mean 0.006880 fractal_dimension_worst 0.000000 fractal_dimension_mean 0.000000 symmetry_mean 0.000000 compactness_mean 0.000000 texture_se 0.000000 smoothness_mean 0.000000 area_mean 0.000000 radius_se 0.000000 smoothness_se 0.000000 perimeter_se 0.000000 symmetry_worst 0.000000 compactness_se 0.000000 concave points_se 0.000000 symmetry_se 0.000000 radius_worst 0.000000 perimeter_worst 0.000000 smoothness_worst 0.000000 compactness_worst 0.000000 concavity_worst 0.000000 radius_mean 0.000000 dtype: float64

1 note · View note