samrats
samrats
Untitled
4 posts
Don't wanna be here? Send us removal request.
samrats · 3 years ago
Text
Task
This week’s assignment involves running a k-means cluster analysis. Cluster analysis is an unsupervised machine learning method that partitions the observations in a data set into a smaller set of clusters where each observation belongs to only one cluster. The goal of cluster analysis is to group, or cluster, observations into subsets based on their similarity of responses on multiple variables. Clustering variables should be primarily quantitative variables, but binary variables may also be included.
Your assignment is to run a k-means cluster analysis to identify subgroups of observations in your data set that have similar patterns of response on a set of clustering variables.
Data
This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Predicted attribute: class of iris plant.
Attribute Information:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
Iris Setosa
Iris Versicolour
Iris Virginica
Results
A k-means cluster analysis was conducted to identify classes of iris plants based on their similarity of responses on 4 variables that represent characteristics of the each plant bud. Clustering variables included 4 quantitative variables such as: sepal length, sepal width, petal length, and petal width.
Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. Then k-means cluster analyses was conducted on the training data specifying k=3 clusters (representing three classes: Iris Setosa, Iris Versicolour, Iris Virginica), using Euclidean distance.
To describe the performance of a classifier and see what types of errors our classifier is making a confusion matrix was created. The accuracy score is 0.82, which is quite good due to the small number of observation (n=150).
In [73]:import numpy as np import pandas as pd import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn.cluster import KMeans from sklearn.metrics import accuracy_score from sklearn.decomposition import PCA import seaborn as sns %matplotlib inline rnd_state = 3927
In [2]:iris = datasets.load_iris() data = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target']) data.head()
Out[2]:sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target05.13.51.40.20.014.93.01.40.20.024.73.21.30.20.034.63.11.50.20.045.03.61.40.20.0
In [66]:data.info() RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): sepal length (cm) 150 non-null float64 sepal width (cm) 150 non-null float64 petal length (cm) 150 non-null float64 petal width (cm) 150 non-null float64 target 150 non-null float64 dtypes: float64(5) memory usage: 5.9 KB
In [3]:data.describe()
Out[3]:sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)targetcount150.000000150.000000150.000000150.000000150.000000mean5.8433333.0540003.7586671.1986671.000000std0.8280660.4335941.7644200.7631610.819232min4.3000002.0000001.0000000.1000000.00000025%5.1000002.8000001.6000000.3000000.00000050%5.8000003.0000004.3500001.3000001.00000075%6.4000003.3000005.1000001.8000002.000000max7.9000004.4000006.9000002.5000002.000000
In [4]:pca_transformed = PCA(n_components=2).fit_transform(data.iloc[:, :4])
In [7]:colors=["#9b59b6", "#e74c3c", "#2ecc71"] plt.figure(figsize=(12,5)) plt.subplot(121) plt.scatter(list(map(lambda tup: tup[0], pca_transformed)), list(map(lambda tup: tup[1], pca_transformed)), c=list(map(lambda col: "#9b59b6" if col==0 else "#e74c3c" if col==1 else "#2ecc71", data.target))) plt.title('PCA on Iris data') plt.subplot(122) sns.countplot(data.target, palette=sns.color_palette(colors)) plt.title('Countplot Iris classes');
For visualization purposes, the number of dimensions was reduced to two by applying PCA analysis. The plot illustrates that classes 1 and 2 are not clearly divided. Countplot illustrates that our classes contain the same number of observations (n=50), so they are balanced.
In [85]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(data.iloc[:, :4], data.target, test_size = .3, random_state = rnd_state)
In [86]:classifier = KMeans(n_clusters=3).fit(predictors_train) prediction = classifier.predict(predictors_test)
In [87]:pca_transformed = PCA(n_components=2).fit_transform(predictors_test)
Predicted classes 1 and 2 mismatch the real ones, so the code block below fixes that problem.
In [88]:prediction = np.where(prediction==1, 3, prediction) prediction = np.where(prediction==2, 1, prediction) prediction = np.where(prediction==3, 2, prediction)
In [91]:plt.figure(figsize=(12,5)) plt.subplot(121) plt.scatter(list(map(lambda tup: tup[0], pca_transformed)), list(map(lambda tup: tup[1], pca_transformed)), c=list(map(lambda col: "#9b59b6" if col==0 else "#e74c3c" if col==1 else "#2ecc71", target_test))) plt.title('PCA on Iris data, real classes'); plt.subplot(122) plt.scatter(list(map(lambda tup: tup[0], pca_transformed)), list(map(lambda tup: tup[1], pca_transformed)), c=list(map(lambda col: "#9b59b6" if col==0 else "#e74c3c" if col==1 else "#2ecc71", prediction))) plt.title('PCA on Iris data, predicted classes');
The figure shows that our simple classifier did a good job in identifing the classes, despite the few mistakes.
In [78]:clust_df = predictors_train.reset_index(level=[0]) clust_df.drop('index', axis=1, inplace=True) clust_df['cluster'] = classifier.labels_
In [79]:clust_df.head()
Out[79]:sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)cluster05.72.84.51.3015.62.74.21.3027.13.05.92.1236.53.05.82.2245.93.04.21.50
In [80]:print ('Clustering variable means by cluster') clust_df.groupby('cluster').mean() Clustering variable means by cluster
Out[80]:sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)cluster05.8590912.7909094.3431821.41590914.9897443.4256411.4717950.24871826.8863643.0909095.8545452.077273
In [92]:print('Confusion matrix:\n', pd.crosstab(target_test, prediction, colnames=['Actual'], rownames=['Predicted'], margins=True)) print('\nAccuracy: ', accuracy_score(target_test, prediction)) Confusion matrix: Actual 0 1 2 All Predicted 0.0 11 0 0 11 1.0 0 11 1 12 2.0 0 7 15 22 All 11 18 16 45 Accuracy: 0.8222222222222222
0 notes
samrats · 3 years ago
Text
Task
This week’s assignment involves running a lasso regression analysis. Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.
Your assignment is to run a lasso regression analysis using k-fold cross validation to identify a subset of predictors from a larger pool of predictor variables that best predicts a quantitative response variable.
Data
Dataset description: hourly rental data spanning two years.
Dataset can be found at Kaggle
Features:
yr - year
mnth - month
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weathersit - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
hum - relative humidity
windspeed (mph) - wind speed, miles per hour
windspeed (ms) - wind speed, metre per second
Target:
cnt - number of total rentals
Results
A lasso regression analysis was conducted to predict a number of total bikes rentals from a pool of 12 categorical and quantitative predictor variables that best predicted a quantitative response variable. Categorical predictors included weather condition and a series of 2 binary categorical variables for holiday and workingday to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include year, month, temperature, humidity and wind speed.
Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 12 predictor variables, 10 were retained in the selected model:
atemp: 63.56915200306693
holiday: -282.431748735072
hum: -12.815264427009353
mnth: 0.0
season: 381.77762475080044
temp: 58.035647703871234
weathersit: -514.6381162101678
weekday: 69.84812053893549
windspeed(mph): 0.0
windspeed(ms): -95.71090321577515
workingday: 36.15135752613271
yr: 2091.5182927517903
Train data R-square 0.7899877818517489 Test data R-square 0.8131871527614188
During the estimation process, year and season were most strongly associated with the number of total bikes rentals, followed by temperature and weekday. Holiday, humidity, weather condition and wind speed (ms) were negatively associated with the number of total bikes rentals.
In [1]:import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LassoLarsCV from sklearn import preprocessing from sklearn.metrics import mean_squared_error import seaborn as sns %matplotlib inline rnd_state = 983
In [2]:data = pd.read_csv("data/bikes_rent.csv") data.info() RangeIndex: 731 entries, 0 to 730 Data columns (total 13 columns): season 731 non-null int64 yr 731 non-null int64 mnth 731 non-null int64 holiday 731 non-null int64 weekday 731 non-null int64 workingday 731 non-null int64 weathersit 731 non-null int64 temp 731 non-null float64 atemp 731 non-null float64 hum 731 non-null float64 windspeed(mph) 731 non-null float64 windspeed(ms) 731 non-null float64 cnt 731 non-null int64 dtypes: float64(5), int64(8) memory usage: 74.3 KB
In [3]:data.describe()
Out[3]:seasonyrmnthholidayweekdayworkingdayweathersittempatemphumwindspeed(mph)windspeed(ms)cntcount731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000mean2.4965800.5006846.5198360.0287282.9972640.6839951.39534920.31077623.71769962.78940612.7625765.7052204504.348837std1.1108070.5003423.4519130.1671552.0047870.4652330.5448947.5050918.14805914.2429105.1923572.3211251937.211452min1.0000000.0000001.0000000.0000000.0000000.0000001.0000002.4243463.9534800.0000001.5002440.67065022.00000025%2.0000000.0000004.0000000.0000001.0000000.0000001.00000013.82042416.89212552.0000009.0416504.0418643152.00000050%3.0000001.0000007.0000000.0000003.0000001.0000001.00000020.43165324.33665062.66670012.1253255.4203514548.00000075%3.0000001.00000010.0000000.0000005.0000001.0000002.00000026.87207730.43010073.02085015.6253716.9849675956.000000max4.0000001.00000012.0000001.0000006.0000001.0000003.00000035.32834742.04480097.25000034.00002115.1989378714.000000
In [4]:data.head()
Out[4]:seasonyrmnthholidayweekdayworkingdayweathersittempatemphumwindspeed(mph)windspeed(ms)cnt0101060214.11084718.1812580.583310.7498824.8054909851101000214.90259817.6869569.608716.6521137.443949801210101118.0509249.4702543.727316.6367037.4370601349310102118.20000010.6061059.043510.7398324.8009981562410103119.30523711.4635043.695712.5223005.5978101600
In [5]:data.dropna(inplace=True)
In [17]:fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(20, 10)) for idx, feature in enumerate(data.columns.values[:-1]): data.plot(feature, 'cnt', subplots=True, kind='scatter', ax=axes[int(idx / 4), idx % 4], c='#87486e');
The plot above shows that there is a linear dependence between temp, atemp and cnt features. The correlations below confirm that observation.
In [7]:data.iloc[:, :12].corrwith(data['cnt'])
Out[7]:season 0.406100 yr 0.566710 mnth 0.279977 holiday -0.068348 weekday 0.067443 workingday 0.061156 weathersit -0.297391 temp 0.627494 atemp 0.631066 hum -0.100659 windspeed(mph) -0.234545 windspeed(ms) -0.234545 dtype: float64
In [8]:plt.figure(figsize=(15, 5)) sns.heatmap(data[['temp', 'atemp', 'hum', 'windspeed(mph)', 'windspeed(ms)', 'cnt']].corr(), annot=True, fmt='1.4f');
There is a strong correlation between temp and atemp, as well as windspeed(mph) and windspeed(ms) features, due to the fact that they represent similar metrics in different measures. In further analysis two of those features must be dropped or applyed with penalty (L2 or Lasso regression).
In [9]:predictors = data.iloc[:, :12] target = data['cnt']
In [10]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(predictors, target, test_size = .3, random_state = rnd_state)
In [11]:model = LassoLarsCV(cv=10, precompute=False).fit(predictors_train, target_train)
In [12]:dict(zip(predictors.columns, model.coef_))
Out[12]:{'atemp': 63.56915200306693, 'holiday': -282.431748735072, 'hum': -12.815264427009353, 'mnth': 0.0, 'season': 381.77762475080044, 'temp': 58.035647703871234, 'weathersit': -514.6381162101678, 'weekday': 69.84812053893549, 'windspeed(mph)': 0.0, 'windspeed(ms)': -95.71090321577515, 'workingday': 36.15135752613271, 'yr': 2091.5182927517903}
In [13]:log_alphas =-np.log10(model.alphas_) plt.figure(figsize=(10, 5)) for idx, feature in enumerate(predictors.columns): plt.plot(log_alphas, list(map(lambda r: r[idx], model.coef_path_.T)), label=feature) plt.legend(loc="upper right", bbox_to_anchor=(1.4, 0.95)) plt.xlabel("-log10(alpha)") plt.ylabel("Feature weight") plt.title("Lasso");
In [14]:log_cv_alphas =-np.log10(model.cv_alphas_) plt.figure(figsize=(10, 5)) plt.plot(log_cv_alphas, model.mse_path_, ':') plt.plot(log_cv_alphas, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log10(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold');
In [16]:rsquared_train = model.score(predictors_train, target_train) rsquared_test = model.score(predictors_test, target_test) print('Train data R-square', rsquared_train) print('Test data R-square', rsquared_test) Train data R-square 0.7899877818517489 Test data R-square 0.8131871527614188
0 notes
samrats · 3 years ago
Text
Task
The second assignment deals with Random Forests. Random forests are predictive models that allow for a data driven exploration of many explanatory variables in predicting a response or target variable. Random forests provide importance scores for each explanatory variable and also allow you to evaluate any increases in correct classification with the growing of smaller and larger number of trees.
Run a Random Forest.
You will need to perform a random forest analysis to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable.
Data
The dataset is related to red variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Dataset can be found at UCI Machine Learning Repository
Attribute Information (For more information, read [Cortez et al., 2009]): Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Results
Random forest and ExtraTrees classifier were deployed to evaluate the importance of a series of explanatory variables in predicting a categorical response variable - red wine quality (score between 0 and 10). The following explanatory variables were included: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol.
The explanatory variables with the highest importance score (evaluated by both classifiers) are alcohol, volatile acidity, sulphates. The accuracy of the Random forest and ExtraTrees clasifier is about 67%, which is quite good for highly unbalanced and hardly distinguished from each other classes. The subsequent growing of multiple trees rather than a single tree, adding a lot to the overall score of the model. For Random forest the number of estimators is 20, while for ExtraTrees classifier - 12, because the second classifier grows up much faster.
Code
In [1]:import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.manifold import MDS from sklearn.metrics.pairwise import pairwise_distances from sklearn.metrics import accuracy_score import seaborn as sns %matplotlib inline rnd_state = 4536
In [2]:data = pd.read_csv('Data\winequality-red.csv', sep=';') data.info() RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): fixed acidity 1599 non-null float64 volatile acidity 1599 non-null float64 citric acid 1599 non-null float64 residual sugar 1599 non-null float64 chlorides 1599 non-null float64 free sulfur dioxide 1599 non-null float64 total sulfur dioxide 1599 non-null float64 density 1599 non-null float64 pH 1599 non-null float64 sulphates 1599 non-null float64 alcohol 1599 non-null float64 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
In [3]:data.head()
Out[3]:fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality07.40.700.001.90.07611.034.00.99783.510.569.4517.80.880.002.60.09825.067.00.99683.200.689.8527.80.760.042.30.09215.054.00.99703.260.659.85311.20.280.561.90.07517.060.00.99803.160.589.8647.40.700.001.90.07611.034.00.99783.510.569.45
In [4]:data.describe()
Out[4]:fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitycount1599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.000000mean8.3196370.5278210.2709762.5388060.08746715.87492246.4677920.9967473.3111130.65814910.4229835.636023std1.7410960.1790600.1948011.4099280.04706510.46015732.8953240.0018870.1543860.1695071.0656680.807569min4.6000000.1200000.0000000.9000000.0120001.0000006.0000000.9900702.7400000.3300008.4000003.00000025%7.1000000.3900000.0900001.9000000.0700007.00000022.0000000.9956003.2100000.5500009.5000005.00000050%7.9000000.5200000.2600002.2000000.07900014.00000038.0000000.9967503.3100000.62000010.2000006.00000075%9.2000000.6400000.4200002.6000000.09000021.00000062.0000000.9978353.4000000.73000011.1000006.000000max15.9000001.5800001.00000015.5000000.61100072.000000289.0000001.0036904.0100002.00000014.9000008.000000
Plots
For visualization purposes, the number of dimensions was reduced to two by applying MDS method with cosine distance. The plot illustrates that our classes are not clearly divided into parts.
In [5]:model = MDS(random_state=rnd_state, n_components=2, dissimilarity='precomputed') %time representation = model.fit_transform(pairwise_distances(data.iloc[:, :11], metric='cosine')) Wall time: 38.7 s
In [6]:colors = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"] plt.figure(figsize=(12, 4)) plt.subplot(121) plt.scatter(representation[:, 0], representation[:, 1], c=colors) plt.subplot(122) sns.countplot(x='quality', data=data, palette=sns.color_palette(colors));
Moreover, our classes are highly unbalanced, so in our classifier we should add parameter class_weight='balanced'.
In [7]:predictors = data.iloc[:, :11] target = data.quality
In [8]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(predictors, target, test_size = .3, random_state = rnd_state)
RandomForest classifier
In [9]:list_estimators = list(range(1, 50, 5)) rf_scoring = [] for n_estimators in list_estimators: classifier = RandomForestClassifier(random_state = rnd_state, n_jobs =-1, class_weight='balanced', n_estimators=n_estimators) score = cross_val_score(classifier, predictors_train, target_train, cv=5, n_jobs=-1, scoring = 'accuracy') rf_scoring.append(score.mean())
In [10]:plt.plot(list_estimators, rf_scoring) plt.title('Accuracy VS trees number');
In [11]:classifier = RandomForestClassifier(random_state = rnd_state, n_jobs =-1, class_weight='balanced', n_estimators=20) classifier.fit(predictors_train, target_train)
Out[11]:RandomForestClassifier(bootstrap=True, class_weight='balanced', criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1, oob_score=False, random_state=4536, verbose=0, warm_start=False)
In [12]:prediction = classifier.predict(predictors_test)
In [13]:print('Confusion matrix:\n', pd.crosstab(target_test, prediction, colnames=['Predicted'], rownames=['Actual'], margins=True)) print('\nAccuracy: ', accuracy_score(target_test, prediction)) Confusion matrix: Predicted 3 4 5 6 7 All Actual 3 0 0 3 0 0 3 4 0 1 9 6 0 16 5 2 1 166 41 3 213 6 0 0 46 131 14 191 7 0 0 5 25 23 53 8 0 0 0 3 1 4 All 2 2 229 206 41 480 Accuracy: 0.66875
In [14]:feature_importance = pd.Series(classifier.feature_importances_, index=data.columns.values[:11]).sort_values(ascending=False) feature_importance
Out[14]:volatile acidity 0.133023 alcohol 0.130114 sulphates 0.129498 citric acid 0.106427 total sulfur dioxide 0.094647 chlorides 0.086298 density 0.079843 pH 0.066566 residual sugar 0.061344 fixed acidity 0.058251 free sulfur dioxide 0.053990 dtype: float64
In [15]:et_scoring = [] for n_estimators in list_estimators: classifier = ExtraTreesClassifier(random_state = rnd_state, n_jobs =-1, class_weight='balanced', n_estimators=n_estimators) score = cross_val_score(classifier, predictors_train, target_train, cv=5, n_jobs=-1, scoring = 'accuracy') et_scoring.append(score.mean())
In [16]:plt.plot(list_estimators, et_scoring) plt.title('Accuracy VS trees number');
In [17]:classifier = ExtraTreesClassifier(random_state = rnd_state, n_jobs =-1, class_weight='balanced', n_estimators=12) classifier.fit(predictors_train, target_train)
Out[17]:ExtraTreesClassifier(bootstrap=False, class_weight='balanced', criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=12, n_jobs=-1, oob_score=False, random_state=4536, verbose=0, warm_start=False)
In [18]:prediction = classifier.predict(predictors_test)
In [19]:print('Confusion matrix:\n', pd.crosstab(target_test, prediction, colnames=['Predicted'], rownames=['Actual'], margins=True)) print('\nAccuracy: ', accuracy_score(target_test, prediction)) Confusion matrix: Predicted 3 4 5 6 7 8 All Actual 3 0 1 2 0 0 0 3 4 0 0 9 7 0 0 16 5 2 2 168 39 2 0 213 6 0 0 49 130 11 1 191 7 0 0 2 27 24 0 53 8 0 0 0 3 1 0 4 All 2 3 230 206 38 1 480 Accuracy: 0.6708333333333333
In [20]:feature_importance = pd.Series(classifier.feature_importances_, index=data.columns.values[:11]).sort_values(ascending=False) feature_importance
Out[20]:alcohol 0.157267 volatile acidity 0.132768 sulphates 0.100874 citric acid 0.095077 density 0.082334 chlorides 0.079283 total sulfur dioxide 0.076803 pH 0.074638 fixed acidity 0.069826 residual sugar 0.066551 free sulfur dioxide 0.064579 dtype: float64
0 notes
samrats · 3 years ago
Text
Task
This week’s assignment involves decision trees, and more specifically, classification trees. Decision trees are predictive models that allow for a data driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. When the response variable is categorical (two levels), the model is a called a classification tree. Explanatory variables can be either quantitative, categorical or both. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again which choose variable constellations that best predict the response (i.e. target) variable.
Run a Classification Tree.
You will need to perform a decision tree analysis to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable.
Data
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
Dataset can be found at UCI Machine Learning Repository
In this Assignment the Decision tree has been applied to classification of breast cancer detection.
Attribute Information:
id - ID number
diagnosis (M = malignant, B = benign)
3-32 extra features
Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)
All feature values are recoded with four significant digits. Missing attribute values: none Class distribution: 357 benign, 212 malignant
Results
Generated decision tree can be found below:
In [17]:img
Out[17]:
Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable (breast cancer diagnosis: malignant or benign).
The dataset was splitted into train and test samples in ratio 70\30.
After fitting the classifier the key metrics were calculated - confusion matrix and accuracy = 0.924. This is a good result for a model trained on a small dataset.
From decision tree we can observe:
The malignant tumor is tend to have much more visible affected areas, texture and concave points, while the benign's characteristics are significantly lower.
The most important features are:
concave points_worst = 0.707688
area_worst = 0.114771
concave points_mean = 0.034234
fractal_dimension_se = 0.026301
texture_worst = 0.026300
area_se = 0.025201
concavity_se = 0.024540
texture_mean = 0.023671
perimeter_mean = 0.010415
concavity_mean = 0.006880
Code
In [1]:import pandas as pd import numpy as np from sklearn.metrics import*from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus from sklearn.manifold import TSNE from matplotlib import pyplot as plt %matplotlib inline rnd_state = 23468
Load data
In [2]:data = pd.read_csv('Data/breast_cancer.csv') data.info() RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): id 569 non-null int64 diagnosis 569 non-null object radius_mean 569 non-null float64 texture_mean 569 non-null float64 perimeter_mean 569 non-null float64 area_mean 569 non-null float64 smoothness_mean 569 non-null float64 compactness_mean 569 non-null float64 concavity_mean 569 non-null float64 concave points_mean 569 non-null float64 symmetry_mean 569 non-null float64 fractal_dimension_mean 569 non-null float64 radius_se 569 non-null float64 texture_se 569 non-null float64 perimeter_se 569 non-null float64 area_se 569 non-null float64 smoothness_se 569 non-null float64 compactness_se 569 non-null float64 concavity_se 569 non-null float64 concave points_se 569 non-null float64 symmetry_se 569 non-null float64 fractal_dimension_se 569 non-null float64 radius_worst 569 non-null float64 texture_worst 569 non-null float64 perimeter_worst 569 non-null float64 area_worst 569 non-null float64 smoothness_worst 569 non-null float64 compactness_worst 569 non-null float64 concavity_worst 569 non-null float64 concave points_worst 569 non-null float64 symmetry_worst 569 non-null float64 fractal_dimension_worst 569 non-null float64 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB
In the output above there is an empty column 'Unnamed: 32', so next it should be dropped.
In [3]:data.drop('Unnamed: 32', axis=1, inplace=True) data.diagnosis = np.where(data.diagnosis=='M', 1, 0) # Decode diagnosis into binary data.describe()
Out[3]:iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worstcount5.690000e+02569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000...569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000mean3.037183e+070.37258314.12729219.28964991.969033654.8891040.0963600.1043410.0887990.048919...16.26919025.677223107.261213880.5831280.1323690.2542650.2721880.1146060.2900760.083946std1.250206e+080.4839183.5240494.30103624.298981351.9141290.0140640.0528130.0797200.038803...4.8332426.14625833.602542569.3569930.0228320.1573360.2086240.0657320.0618670.018061min8.670000e+030.0000006.9810009.71000043.790000143.5000000.0526300.0193800.0000000.000000...7.93000012.02000050.410000185.2000000.0711700.0272900.0000000.0000000.1565000.05504025%8.692180e+050.00000011.70000016.17000075.170000420.3000000.0863700.0649200.0295600.020310...13.01000021.08000084.110000515.3000000.1166000.1472000.1145000.0649300.2504000.07146050%9.060240e+050.00000013.37000018.84000086.240000551.1000000.0958700.0926300.0615400.033500...14.97000025.41000097.660000686.5000000.1313000.2119000.2267000.0999300.2822000.08004075%8.813129e+061.00000015.78000021.800000104.100000782.7000000.1053000.1304000.1307000.074000...18.79000029.720000125.4000001084.0000000.1460000.3391000.3829000.1614000.3179000.092080max9.113205e+081.00000028.11000039.280000188.5000002501.0000000.1634000.3454000.4268000.201200...36.04000049.540000251.2000004254.0000000.2226001.0580001.2520000.2910000.6638000.207500
8 rows × 32 columns
In [4]:data.head()
Out[4]:iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worst0842302117.9910.38122.801001.00.118400.277600.30010.14710...25.3817.33184.602019.00.16220.66560.71190.26540.46010.118901842517120.5717.77132.901326.00.084740.078640.08690.07017...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902284300903119.6921.25130.001203.00.109600.159900.19740.12790...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758384348301111.4220.3877.58386.10.142500.283900.24140.10520...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300484358402120.2914.34135.101297.00.100300.132800.19800.10430...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678
5 rows × 32 columns
Plots
For visualization purposes, the number of dimensions was reduced to two by applying t-SNE method. The plot illustrates that our classes are not clearly divided into two parts, so the nonlinear methods (like Decision tree) may solve this problem.
In [15]:model = TSNE(random_state=rnd_state, n_components=2) representation = model.fit_transform(data.iloc[:, 2:])
In [16]:plt.scatter(representation[:, 0], representation[:, 1], c=data.diagnosis, alpha=0.5, cmap=plt.cm.get_cmap('Set1', 2)) plt.colorbar(ticks=range(2));
Decision tree
In [6]:predictors = data.iloc[:, 2:] target = data.diagnosis
To train a Decision tree the dataset was splitted into train and test samples in proportion 70/30.
In [7]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(predictors, target, test_size = .3, random_state = rnd_state)
In [8]:print('predictors_train:', predictors_train.shape) print('predictors_test:', predictors_test.shape) print('target_train:', target_train.shape) print('target_test:', target_test.shape) predictors_train: (398, 30) predictors_test: (171, 30) target_train: (398,) target_test: (171,)
In [9]:print(np.sum(target_train==0)) print(np.sum(target_train==1)) 253 145
Our train sample is quite balanced, so there is no need in balancing it.
In [10]:classifier = DecisionTreeClassifier(random_state = rnd_state).fit(predictors_train, target_train)
In [11]:prediction = classifier.predict(predictors_test)
In [12]:print('Confusion matrix:\n', pd.crosstab(target_test, prediction, colnames=['Actual'], rownames=['Predicted'], margins=True)) print('\nAccuracy: ', accuracy_score(target_test, prediction)) Confusion matrix: Actual 0 1 All Predicted 0 96 8 104 1 5 62 67 All 101 70 171 Accuracy: 0.9239766081871345
In [13]:out = StringIO() tree.export_graphviz(classifier, out_file = out, feature_names = predictors_train.columns.values, proportion =True, filled =True) graph = pydotplus.graph_from_dot_data(out.getvalue()) img = Image(data = graph.create_png()) with open('output.png', 'wb') as f: f.write(img.data)
In [14]:feature_importance = pd.Series(classifier.feature_importances_, index=data.columns.values[2:]).sort_values(ascending=False) feature_importance
Out[14]:concave points_worst 0.707688 area_worst 0.114771 concave points_mean 0.034234 fractal_dimension_se 0.026301 texture_worst 0.026300 area_se 0.025201 concavity_se 0.024540 texture_mean 0.023671 perimeter_mean 0.010415 concavity_mean 0.006880 fractal_dimension_worst 0.000000 fractal_dimension_mean 0.000000 symmetry_mean 0.000000 compactness_mean 0.000000 texture_se 0.000000 smoothness_mean 0.000000 area_mean 0.000000 radius_se 0.000000 smoothness_se 0.000000 perimeter_se 0.000000 symmetry_worst 0.000000 compactness_se 0.000000 concave points_se 0.000000 symmetry_se 0.000000 radius_worst 0.000000 perimeter_worst 0.000000 smoothness_worst 0.000000 compactness_worst 0.000000 concavity_worst 0.000000 radius_mean 0.000000 dtype: float64
1 note · View note