itssanjublr-blog - Tumblr blog

itssanjublr-blog · 6 years ago

Text

k means cluster

The chart below shows the dataset for 4,000 drivers, with the distance feature on the x-axis and speeding feature on the y-axis.

import numpy as np

from sklearn.cluster import KMeans

### For the purposes of this example, we store feature data from our ### dataframe `df`, in the `f1` and `f2` arrays. We combine this into

### a feature matrix `X` before entering it into the algorithm.

f1 = df['Distance_Feature'].values

f2 = df['Speeding_Feature'].values

X=np.matrix(zip(f1,f2))

kmeans = KMeans(n_clusters=2).fit(X)

###The chart below shows the results. Visually, you can see that the K-means algorithm splits the two groups based on the distance feature. Each cluster centroid is marked with a star.Group 1 Centroid = (50, 5.2)Group 2 Centroid = (180.3, 10.5)Using domain knowledge of the dataset, we can infer that Group 1 is urban drivers and Group 2 is rural drivers.

kmeans = KMeans(n_clusters=4).fit(X)

### The chart below shows the resulting clusters. We see that four distinct groups have been identified by the algorithm; now speeding drivers have been separated from those who follow speed limits, in addition to the rural vs. urban divide. The threshold for speeding is lower with the urban driver group than for the rural drivers, likely due to urban drivers spending more time in intersections and stop-and-go traffic.

0 notes

itssanjublr-blog · 6 years ago

Text

lasso regression

import math import matplotlib.pyplot as plt import pandas as pd import numpy as np

from sklearn.linear_model import Lasso from sklearn.linear_model import LinearRegression from sklearn.datasets import load_breast_cancer from sklearn.cross_validation import train_test_split cancer = load_breast_cancer() #print cancer.keys() cancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names) #print cancer_df.head(3) X = cancer.data Y = cancer.target X_train,X_test,y_train,y_test=train_test_split(X,Y, test_size=0.3, random_state=31) lasso = Lasso() lasso.fit(X_train,y_train) train_score=lasso.score(X_train,y_train) test_score=lasso.score(X_test,y_test) coeff_used = np.sum(lasso.coef_!=0) print ("training score:", train_score ) print ("test score: ", test_score) print ("number of features used: ", coeff_used) lasso001 = Lasso(alpha=0.01, max_iter=10e5) lasso001.fit(X_train,y_train) train_score001=lasso001.score(X_train,y_train) test_score001=lasso001.score(X_test,y_test) coeff_used001 = np.sum(lasso001.coef_!=0) print ("training score for alpha=0.01:", train_score001) print ("test score for alpha =0.01: ", test_score001) print ("number of features used: for alpha =0.01:", coeff_used001) lasso00001 = Lasso(alpha=0.0001, max_iter=10e5) lasso00001.fit(X_train,y_train) train_score00001=lasso00001.score(X_train,y_train) test_score00001=lasso00001.score(X_test,y_test) coeff_used00001 = np.sum(lasso00001.coef_!=0) print ("training score for alpha=0.0001:", train_score00001 ) print ("test score for alpha =0.0001: ", test_score00001) print ("number of features used: for alpha =0.0001:", coeff_used00001)

plt.subplot(1,2,1) plt.plot(lasso.coef_,alpha=0.7,linestyle='none',marker='*',markersize=5,color='red',label=r'Lasso; $\alpha = 1$',zorder=7) # alpha here is for transparency plt.plot(lasso001.coef_,alpha=0.5,linestyle='none',marker='d',markersize=6,color='blue',label=r'Lasso; $\alpha = 0.01$') # alpha here is for transparency

plt.xlabel('Coefficient Index',fontsize=16) plt.ylabel('Coefficient Magnitude',fontsize=16) plt.legend(fontsize=13,loc=4) plt.subplot(1,2,2) plt.plot(lasso.coef_,alpha=0.7,linestyle='none',marker='*',markersize=5,color='red',label=r'Lasso; $\alpha = 1$',zorder=7) # alpha here is for transparency plt.plot(lasso001.coef_,alpha=0.5,linestyle='none',marker='d',markersize=6,color='blue',label=r'Lasso; $\alpha = 0.01$') # alpha here is for transparency plt.plot(lasso00001.coef_,alpha=0.8,linestyle='none',marker='v',markersize=6,color='black',label=r'Lasso; $\alpha = 0.00001$') # alpha here is for transparency

plt.xlabel('Coefficient Index',fontsize=16) plt.ylabel('Coefficient Magnitude',fontsize=16) plt.legend(fontsize=13,loc=4) plt.tight_layout() plt.show()

OUTPUT

Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)] Type "copyright", "credits" or "license" for more information.

IPython 6.5.0 -- An enhanced Interactive Python. C:\Users\lenovo\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

runfile('C:/Users/lenovo/lasso.py', wdir='C:/Users/lenovo') Reloaded modules: __mp_main__ training score: 0.5600974529893081 test score: 0.5832244618818156 number of features used: 4 training score for alpha=0.01: 0.7037865778498826 test score for alpha =0.01: 0.6641831577726227 number of features used: for alpha =0.01: 10 training score for alpha=0.0001: 0.7754092006936699 test score for alpha =0.0001: 0.7318608210757909 number of features used: for alpha =0.0001: 22

0 notes

itssanjublr-blog · 6 years ago

Text

random forest

Pima Indians diabetes data set

import pandas as pd # list for column headers names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] # open file with pd.read_csv df = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", names=names) print(df.shape) # print head of data set print(df.head()) X = df.drop("class", axis=1) y = df["class"] from sklearn.model_selection import train_test_split # implementing train-test-split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66) from sklearn.ensemble import RandomForestClassifier # random forest model creation rfc = RandomForestClassifier() rfc.fit(X_train,y_train) # predictions rfc_predict = rfc.predict(X_test) from sklearn.model_selection import cross_val_score from sklearn.metrics import classification_report, confusion_matrix rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring="roc_auc") print("=== Confusion Matrix ===") print(confusion_matrix(y_test, rfc_predict)) print('\n') print("=== Classification Report ===") print(classification_report(y_test, rfc_predict)) print('\n') print("=== All AUC Scores ===") print(rfc_cv_score) print('\n') print("=== Mean AUC Score ===") print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())

OUTPUTS

Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)] Type "copyright", "credits" or "license" for more information.

IPython 6.5.0 -- An enhanced Interactive Python.

runfile('C:/Users/lenovo/random forest.py', wdir='C:/Users/lenovo') (768, 9) preg plas pres skin test mass pedi age class 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 C:\Users\lenovo\Anaconda3\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d

=== Confusion Matrix ===

[[154 22] [ 34 44]]

=== Classification Report ===

precision recall f1-score support

0 0.82 0.88 0.85 176 1 0.67 0.56 0.61 78

avg / total 0.77 0.78 0.77 254

=== All AUC Scores ===

[0.73666667 0.85814815 0.82925926 0.73407407 0.71296296 0.78185185 0.81888889 0.85 0.75884615 0.82846154]

=== Mean AUC Score ===

Mean AUC Score - Random Forest: 0.7909159544159543

0 notes

itssanjublr-blog · 6 years ago

Text

decision tree

This is a classification tree on the data to predict on the basis of tumors size and shapes to predict whether it is malignant or benign.

Decision tree diagram

#importing modules and libraries

‘‘‘

ABOUT DATA

Attribute Information:

ID number 2) Diagnosis (M = malignant, B = benign) 3–32)

Ten real-valued features are computed for each cell nucleus:

radius (mean of distances from the center to points on the perimeter)

texture (standard deviation of gray-scale values)

perimeter

area

smoothness (local variation in radius lengths)

compactness (perimeter² / area — 1.0)

concavity (severity of concave portions of the contour)

concave points (number of concave portions of the contour)

symmetry

fractal dimension (“coastline approximation” — 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

‘‘‘

import sys

import pandas as pd

import matplotlib

import numpy as np

import scipy as sp import pydotplus

import IPython

import sklearn from sklearn.model_selection import train_test_split from matplotlib import pyplot as plt from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() from sklearn.tree import DecisionTreeClassifier

# load datasets

cancer = load_breast_cancer()

#splitting data into test and train

‘‘‘

The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

We will do this using SciKit-Learn library in Python using the train_test_split method.

‘‘‘

X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, stratify=cancer.target, random_state=42) tree = DecisionTreeClassifier(random_state=0) #decision tree method tree.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train))) print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test))) from sklearn.tree import export_graphviz export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"], feature_names=cancer.feature_names, impurity=False, filled=True) import graphviz

with open("tree.dot") as f: dot_graph = f.read() display(graphviz.Source(dot_graph))

def plot_feature_importances_cancer(model): n_features = cancer.data.shape[1] plt.barh(np.arange(n_features), model.feature_importances_, align='center') plt.yticks(np.arange(n_features), cancer.feature_names) plt.xlabel("Feature importance") plt.ylabel("Feature") plt.ylim(-1, n_features)

plot_feature_importances_cancer(tree)

OUTPUT:

Accuracy on training set: 1.000 Accuracy on test set: 0.937

DOT DATA:

digraph Tree {

node [shape=box, style="filled", color="black"] ;

0 [label="worst radius <= 16.795\nsamples = 426\nvalue = [159, 267]\nclass = benign", fillcolor="#399de567"] ;

1 [label="worst concave points <= 0.136\nsamples = 284\nvalue = [25, 259]\nclass = benign", fillcolor="#399de5e6"] ;

0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;

2 [label="radius error <= 1.048\nsamples = 252\nvalue = [4, 248]\nclass = benign", fillcolor="#399de5fb"] ;

1 -> 2 ;

3 [label="smoothness error <= 0.003\nsamples = 251\nvalue = [3, 248]\nclass = benign", fillcolor="#399de5fc"] ;

2 -> 3 ;

4 [label="mean texture <= 19.9\nsamples = 4\nvalue = [1, 3]\nclass = benign", fillcolor="#399de5aa"] ;

3 -> 4 ;

5 [label="samples = 3\nvalue = [0, 3]\nclass = benign", fillcolor="#399de5ff"] ;

4 -> 5 ;

6 [label="samples = 1\nvalue = [1, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

4 -> 6 ;

7 [label="area error <= 48.7\nsamples = 247\nvalue = [2, 245]\nclass = benign", fillcolor="#399de5fd"] ;

3 -> 7 ;

8 [label="worst texture <= 33.35\nsamples = 243\nvalue = [1, 242]\nclass = benign", fillcolor="#399de5fe"] ;

7 -> 8 ;

9 [label="samples = 225\nvalue = [0, 225]\nclass = benign", fillcolor="#399de5ff"] ;

8 -> 9 ;

10 [label="worst texture <= 33.8\nsamples = 18\nvalue = [1, 17]\nclass = benign", fillcolor="#399de5f0"] ;

8 -> 10 ;

11 [label="samples = 1\nvalue = [1, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

10 -> 11 ;

12 [label="samples = 17\nvalue = [0, 17]\nclass = benign", fillcolor="#399de5ff"] ;

10 -> 12 ;

13 [label="mean concavity <= 0.029\nsamples = 4\nvalue = [1, 3]\nclass = benign", fillcolor="#399de5aa"] ;

7 -> 13 ;

14 [label="samples = 1\nvalue = [1, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

13 -> 14 ;

15 [label="samples = 3\nvalue = [0, 3]\nclass = benign", fillcolor="#399de5ff"] ;

13 -> 15 ;

16 [label="samples = 1\nvalue = [1, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

2 -> 16 ;

17 [label="worst texture <= 25.62\nsamples = 32\nvalue = [21, 11]\nclass = malignant", fillcolor="#e5813979"] ;

1 -> 17 ;

18 [label="worst area <= 817.1\nsamples = 12\nvalue = [3, 9]\nclass = benign", fillcolor="#399de5aa"] ;

17 -> 18 ;

19 [label="mean smoothness <= 0.123\nsamples = 10\nvalue = [1, 9]\nclass = benign", fillcolor="#399de5e3"] ;

18 -> 19 ;

20 [label="samples = 9\nvalue = [0, 9]\nclass = benign", fillcolor="#399de5ff"] ;

19 -> 20 ;

21 [label="samples = 1\nvalue = [1, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

19 -> 21 ;

22 [label="samples = 2\nvalue = [2, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

18 -> 22 ;

23 [label="worst symmetry <= 0.268\nsamples = 20\nvalue = [18, 2]\nclass = malignant", fillcolor="#e58139e3"] ;

17 -> 23 ;

24 [label="fractal dimension error <= 0.002\nsamples = 3\nvalue = [1, 2]\nclass = benign", fillcolor="#399de57f"] ;

23 -> 24 ;

25 [label="samples = 1\nvalue = [1, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

24 -> 25 ;

26 [label="samples = 2\nvalue = [0, 2]\nclass = benign", fillcolor="#399de5ff"] ;

24 -> 26 ;

27 [label="samples = 17\nvalue = [17, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

23 -> 27 ;

28 [label="texture error <= 0.473\nsamples = 142\nvalue = [134, 8]\nclass = malignant", fillcolor="#e58139f0"] ;

0 -> 28 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;

29 [label="samples = 5\nvalue = [0, 5]\nclass = benign", fillcolor="#399de5ff"] ;

28 -> 29 ;

30 [label="worst concavity <= 0.191\nsamples = 137\nvalue = [134, 3]\nclass = malignant", fillcolor="#e58139f9"] ;

28 -> 30 ;

31 [label="worst texture <= 30.975\nsamples = 5\nvalue = [2, 3]\nclass = benign", fillcolor="#399de555"] ;

30 -> 31 ;

32 [label="samples = 3\nvalue = [0, 3]\nclass = benign", fillcolor="#399de5ff"] ;

31 -> 32 ;

33 [label="samples = 2\nvalue = [2, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

31 -> 33 ;

34 [label="samples = 132\nvalue = [132, 0]\nclass = malignant", fillcolor="#e58139ff"] ;

30 -> 34 ;

}

0 notes