mlfordataanalysis - Tumblr blog

mlfordataanalysis · 5 years ago

Text

Building Decision Tree Model :

In This project assignment I will explore non linear relationships among a series of explanatory variables and binary, categorical response variables.

Decision Tree Classification

First Establish my Environment by importing all the features that will be needed to generate my analysis.

from pandas import Series ,DataFrame

import pandas as pd

import numpy as np

import os

import matplotlib.pylab as plt

from sklearn.cross_validation import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import Classification_report

from sklearn import metrics import scikit-learn metrics

from sklearn import svm

import sklearn.metrics

!pip install Graphviz

!pip install pydotplus

==============

Load The data and Performing Data Engineering

Sample_Data = pd.read_csv(“Tree.csv”)

# Change All lower_Case columns in the dataset to Upper Case

Sample_Data.columns=map(str.upper ,Sample_Data.columns)

Data = Sample_Data.rename(columns={“BIO_SEX”:“MALE”})

Data.head()

# Examining Datatype and summary statistics Data_Clean = Data.dropna() print("Display Data Type:\n",Data_Clean.dtypes) print("===============================================") print("===============================================")

Data_Clean.describe()

# # Set Response and target Variables predictors = Data_Clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN',"AGE" ,'ALCEVR1','ALCPROBS1','TREG1','DEP1','ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1',EXPEL1','FAMCONCT','PARACTV']]

# We will base our analysis on violence as our sample data

targets = Data_Clean.VIOL1

# Splitting Data

To Understand model performance, Dividing the dataset into a training set and test set is a good strategy. Now , let's split the dataset by using function train_test_split()we need to pass 3 parameters predictors,target , test_set size

split dataset in features and target variable by setting the size ratio to 60% for the training sample and 40% for test sample

Python Code:

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4 , random_state =1)

# Request the shape predictors o training and test samples.

print("pred_train:",pred_train.shape)

print("pred_test:",pred_test.shape)

print ("tar_train:", tar_train.shape)

print ("tar_test:",tar_test.shape)

Building Decision Tree Model

Let's Create a Decision Tree Model using Scikit-learn

Python Code :

# Create a Decision Tree clf = DecisionTreeClassifier() # Train Decision Tree clf = clf.fit(pred_train, tar_train)

# Predict the response for the test dataset this is my y_pred predictions = clf.predict(pred_test)

Evaluating Model Let's estimate , how accurately the classifier or model can predict the type of cultivas Accuracy can be predicted by comparing actual test set values and predicted values.

Python code:

# Model Accuracy , how often is the classifier correct?

sklearn.metrics.confusion_matrix(tar_test , predictions) print("Model_Accuracy:",metrics.accuracy_score(tar_test ,predictions))

# our accuracy came down to .9284 which can be considered to be a very great prediction score.

Visualizing Decision Trees we are going to use scikit-learn's export_graphiz function for display the tree with jupyter notebook . For Plotting tree , we also need to install graphviz and pydotplus export_graphviz function converts decision tree classifier into dot file and pydotplus convert this file to png or displayable forn on jupyter.

!pip install graphviz !pip install pydotplus

# There is a limitation of working with trees in the context of python Specifically with Sklearn Library does not currently support the pruning of the tree .we are left with overfitting decision tree where many branches of leaves do not likely adds substantially to our predictions and accuracy.

#For Exploratory purpose it can be helpful to test a small number of variables in order to first get the feel of the decision tree output.

# First import all needed dependencies

from sklearn.tree import export_graphviz from sklearn.externals.six import StringIO from sklearn.tree import DecisionTreeClassifier

from IPython.display import Image from sklearn import tree import pydotplus

Python Code:

out = StringIO() # Image(graph.create_png) tree.export_graphviz(clf, out_file=out) # request the picture of our decision tree import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue())

Image(graph.create_png())

we can optimize decision tree classifier with With scikit-learn . perform pre-pruning . Maximum depth of the tree can be used as control variable for pre-pruning. In the following example . we will plot a decision tree on the same data with max_depth=3. other than pre-pruning parameters , we can also try to attribute selection measures with entropy.

# Create Decision Tree Classifier object clf= DecisionTreeClassifier(criterion="entropy", max_depth=4) clf = clf.fit(pred_train, tar_train) # Predict the response for the test dataset this is my y_pred predictions = clf.predict(pred_test) # Model Accuracy , How often is the classifier correct ? print ("Accuracy:",metrics.accuracy_score(tar_test ,predictions))

Even though we had a very great prediction score for .9284 before we can see that there a lot that can be done in terms of improving model accuracy now we have .9967 which is better that the previous one.

# let let’s perform a sample test by pruning since this small tree can help us understand more about our classification model .

from sklearn.tree import export_graphviz

from sklearn.externals.six import StringIo

from sklearn.tree import DecisionTreeClassifier

from IPython.display import Image

from sklearn import tree

import pydotplus

out =stringIO()

Image(graph.create_png)

tree.export_graphviz(clf , ou_file =out )

import pydotplus

graph = pydotplus.graph_from_dot_data(out.getvalue)

Image(graph.create_png())

Decision Tree Strength approach are the following :

1. Can select from among a large number of variables those and their interactions that are most important in determining the target response variable to be explained

2.Decision Trees are able to generate understandable rules

3.Decision Trees perform classification without requiring much computation.

4.Can handle large dataset and can predict both binary and categorical target variables (shown in this example) and also quantitative target variables (known as regression trees)

Weakness of Decision tree methods are :

1. Small changes in the data can lead to different splits and this can undermine the interpretability of the model. Also decision trees are not very reproducible on future data.

2.Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute

3.Decision trees are prone to errors in classification problems with many class and relatively small number of training examples.

4.Decision tree can be computationally expensive to train. The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights. Pruning algorithms can also be expensive since many candidate sub-trees must be formed and compared.

References : Machine Learning, Tom Mitchell, McGraw Hill, 1997.

https://www.geeksforgeeks.org/decision-tree/

WEEK THREE

Running Lasso Regression Analysis

# from pandas Series , DataFrame #from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV

# loading Dataset Data = pd.read_csv('Tree.csv') Data

# upper-case all DataFrame column names Data.columns = map(str.upper, Data.columns)

# Data Management Data_Clean = Data.dropna() recode1 = {1:1, 2:0} Data_Clean['MALE']= Data_Clean['BIO_SEX'].map(recode1)

#select predictor variables and target variable as separate data sets predvar= Data_Clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'AGE','ALCEVR1','ALCPROBS1','MAREVER1','COCEVER1','INHEVER1','CIGAVAIL','DEP1', 'ESTEEM1','VIOL1','PASSIST','DEVIANT1','GPA1','EXPEL1','FAMCONCT','PARACTV', 'PARPRES']]

target = Data_Clean.SCHCONN1

#DATA MANAGEMENT

# standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing predictors['MALE']=preprocessing.scale(predictors['MALE'].astype('float64')) predictors['HISPANIC']=preprocessing.scale(predictors['HISPANIC'].astype('float64')) predictors['WHITE']=preprocessing.scale(predictors['WHITE'].astype('float64')) predictors['NAMERICAN']=preprocessing.scale(predictors['NAMERICAN'].astype('float64')) predictors['ASIAN']=preprocessing.scale(predictors['ASIAN'].astype('float64')) predictors['AGE']=preprocessing.scale(predictors['AGE'].astype('float64')) predictors['ALCEVR1']=preprocessing.scale(predictors['ALCEVR1'].astype('float64')) predictors['ALCPROBS1']=preprocessing.scale(predictors['ALCPROBS1'].astype('float64')) predictors['MAREVER1']=preprocessing.scale(predictors['MAREVER1'].astype('float64')) predictors['COCEVER1']=preprocessing.scale(predictors['COCEVER1'].astype('float64')) predictors['INHEVER1']=preprocessing.scale(predictors['INHEVER1'].astype('float64')) predictors['CIGAVAIL']=preprocessing.scale(predictors['CIGAVAIL'].astype('float64')) predictors['DEP1']=preprocessing.scale(predictors['DEP1'].astype('float64')) predictors['ESTEEM1']=preprocessing.scale(predictors['ESTEEM1'].astype('float64')) predictors['VIOL1']=preprocessing.scale(predictors['VIOL1'].astype('float64')) predictors['PASSIST']=preprocessing.scale(predictors['PASSIST'].astype('float64')) predictors['DEVIANT1']=preprocessing.scale(predictors['DEVIANT1'].astype('float64')) predictors['GPA1']=preprocessing.scale(predictors['GPA1'].astype('float64')) predictors['EXPEL1']=preprocessing.scale(predictors['EXPEL1'].astype('float64')) predictors['FAMCONCT']=preprocessing.scale(predictors['FAMCONCT'].astype('float64')) predictors['PARACTV']=preprocessing.scale(predictors['PARACTV'].astype('float64')) predictors['PARPRES']=preprocessing.scale(predictors['PARPRES'].astype('float64'))

# TESTING A LASSO REGRESSION MODEL IN PYTHON

# split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123

# Specify the Lasso Regression model= LassoLarsCV(cv=10, precompute=False).fit(pred_train , tar_train)

# Print Variable names and regression coefficients

dict(zip(predictors.columns, model.coef_))

# # MSE FROM TRAINING AND TEST DATA from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train , model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print('training data MSE') print(train_error) print('test data MSE') print (test_error)

# # R-squared from training and test data

rsquared_train = model.score(pred_train, tar_train) rsquared_test=model.score(pred_test, tar_test) print ('Training data R-squared') print (rsquared_train) print ('Test data R-squared') print (rsquared_test)

# Plot Coefficient progression m_log_alphas= -np.log(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle ='', color ='k',label ='alpha CV') plt.ylabel('REgression Coefficients') plt.xlabel('-log(alpha)') plt.title('regression Coefficient Progression for Lasso Paths')

The plot above shows the value of selecting predictors at any given stage.

The selection process can therefore take the regression coefficients and slightly change by adding a new predictor at each stage and the stage at which each variable entered the mode. With that in mind there was a list of regression coefficients which was the largest like self_esteem it was entered into the model first then others followed from the largest to the smallest coefficient and so on.

# Plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_,':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1),'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--',color='k', label = 'alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Means squared error on each fold')

5 notes · View notes