mlfordataanalysis
mlfordataanalysis
Running a Classification Tree
1 post
Don't wanna be here? Send us removal request.
mlfordataanalysis · 5 years ago
Text
Building  Decision Tree Model : 
In This project assignment  I will explore  non linear relationships among a series of explanatory variables and binary, categorical response variables.  
Decision Tree Classification
First Establish my Environment by importing all the features that will be needed to generate my analysis.
from pandas import Series ,DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import Classification_report
from sklearn import metrics import scikit-learn metrics
from sklearn import svm
import sklearn.metrics
!pip install Graphviz
!pip install pydotplus 
==============
Load The data  and Performing Data Engineering
Sample_Data = pd.read_csv(“Tree.csv”)
# Change All lower_Case columns in the dataset to Upper Case
Sample_Data.columns=map(str.upper ,Sample_Data.columns)
Data = Sample_Data.rename(columns={“BIO_SEX”:“MALE”})
Data.head()
Tumblr media
# Examining Datatype and summary statistics Data_Clean = Data.dropna() print("Display Data Type:\n",Data_Clean.dtypes) print("===============================================") print("===============================================")
Tumblr media
Data_Clean.describe()
Tumblr media
# # Set Response and target Variables predictors  =  Data_Clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN',"AGE" ,'ALCEVR1','ALCPROBS1','TREG1','DEP1','ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1',EXPEL1','FAMCONCT','PARACTV']]
# We will base our analysis on violence as our sample data
targets = Data_Clean.VIOL1
# Splitting Data
To Understand model performance, Dividing the dataset into a training set and test set is a good strategy. Now , let's split the dataset by using function train_test_split()we need to pass 3 parameters predictors,target , test_set size  
split dataset in features and target variable  by setting the size ratio to 60% for the training sample and 40% for test sample
Python Code:
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4 , random_state =1)
# Request the shape predictors o training and test samples.
print("pred_train:",pred_train.shape)
print("pred_test:",pred_test.shape)
print ("tar_train:", tar_train.shape)
print ("tar_test:",tar_test.shape)
Tumblr media
Building  Decision Tree Model
Let's Create a Decision Tree Model using Scikit-learn
Python Code :
# Create a Decision Tree clf = DecisionTreeClassifier() # Train Decision Tree clf = clf.fit(pred_train, tar_train)
# Predict the response for the test dataset this is my y_pred predictions = clf.predict(pred_test)
Evaluating Model Let's estimate , how accurately the classifier or model  can predict the type of cultivas  Accuracy can be predicted by comparing actual test set values and predicted values.
Python code:
# Model Accuracy , how often is the classifier correct?
sklearn.metrics.confusion_matrix(tar_test , predictions) print("Model_Accuracy:",metrics.accuracy_score(tar_test ,predictions))
# our accuracy came down to .9284 which can be considered to be a very great prediction score.  
Tumblr media
Visualizing Decision Trees we are going to use scikit-learn's export_graphiz function for display the  tree with jupyter  notebook . For Plotting tree ,  we also need to install graphviz  and pydotplus export_graphviz function converts decision tree classifier into dot file and pydotplus convert this file to png or displayable forn on jupyter.
!pip install graphviz !pip install pydotplus
# There is a limitation of working with trees in the context of python Specifically with  Sklearn Library does not currently support the pruning  of the tree .we are left with overfitting decision tree where many branches of leaves do not likely adds substantially to our predictions and accuracy.
#For Exploratory purpose it can be helpful to test a  small number of variables in order to first get the feel of the decision tree output.
# First import all needed dependencies
from sklearn.tree import export_graphviz from sklearn.externals.six import StringIO from sklearn.tree import DecisionTreeClassifier
from  IPython.display import Image from sklearn import tree import pydotplus
Python Code:
out = StringIO() # Image(graph.create_png) tree.export_graphviz(clf, out_file=out) # request the picture of our decision tree import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())
Tumblr media
we can optimize decision tree classifier with With scikit-learn . perform pre-pruning . Maximum depth of the tree can be used as control variable for pre-pruning. In the following example . we will plot a decision tree on the same data with max_depth=3. other than pre-pruning parameters , we can also try to attribute selection measures with entropy. 
# Create Decision Tree Classifier object clf= DecisionTreeClassifier(criterion="entropy", max_depth=4) clf = clf.fit(pred_train, tar_train) # Predict the response for the test dataset this is my y_pred predictions = clf.predict(pred_test) # Model Accuracy , How often is the classifier correct ? print ("Accuracy:",metrics.accuracy_score(tar_test ,predictions)) 
Tumblr media
Even though  we had a very great prediction score for .9284 before we can see that there a lot that can be done in terms of improving model accuracy now we have .9967 which is better that the previous one.
# let let’s perform a sample test by pruning since this small tree can help us understand more  about our classification model .
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIo
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image
from sklearn import tree
import pydotplus
out =stringIO()
Image(graph.create_png)
tree.export_graphviz(clf , ou_file =out )
import pydotplus
graph = pydotplus.graph_from_dot_data(out.getvalue)
Image(graph.create_png())
Tumblr media
Decision Tree Strength  approach are the following : 
1. Can select from among a large number of variables those and their interactions that are most important in determining the target response variable to be explained 
2.Decision Trees are able to generate understandable rules 
3.Decision Trees perform classification without requiring much computation. 
4.Can handle large dataset and can predict both binary and categorical target variables (shown in this example) and also quantitative target variables (known as regression trees)
Weakness of Decision tree methods are :
 1. Small changes in the data can lead to different splits and this can undermine the interpretability of the model. Also decision trees are not very reproducible on future data.  
2.Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute 
3.Decision trees are prone to errors in classification problems with many class and relatively small number of training examples.
4.Decision tree can be computationally expensive to train. The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights. Pruning algorithms can also be expensive since many candidate sub-trees must be formed and compared.  
References : Machine Learning, Tom Mitchell, McGraw Hill, 1997.
https://www.geeksforgeeks.org/decision-tree/
WEEK THREE 
Running Lasso Regression Analysis 
# from pandas Series , DataFrame #from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV
# loading Dataset Data = pd.read_csv('Tree.csv') Data
Tumblr media
# upper-case all DataFrame column names Data.columns = map(str.upper, Data.columns)
# Data Management Data_Clean = Data.dropna() recode1 = {1:1, 2:0} Data_Clean['MALE']= Data_Clean['BIO_SEX'].map(recode1)
#select predictor variables and target variable as separate data sets   predvar= Data_Clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'AGE','ALCEVR1','ALCPROBS1','MAREVER1','COCEVER1','INHEVER1','CIGAVAIL','DEP1', 'ESTEEM1','VIOL1','PASSIST','DEVIANT1','GPA1','EXPEL1','FAMCONCT','PARACTV', 'PARPRES']]
target = Data_Clean.SCHCONN1
#DATA MANAGEMENT 
# standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing predictors['MALE']=preprocessing.scale(predictors['MALE'].astype('float64')) predictors['HISPANIC']=preprocessing.scale(predictors['HISPANIC'].astype('float64')) predictors['WHITE']=preprocessing.scale(predictors['WHITE'].astype('float64')) predictors['NAMERICAN']=preprocessing.scale(predictors['NAMERICAN'].astype('float64')) predictors['ASIAN']=preprocessing.scale(predictors['ASIAN'].astype('float64')) predictors['AGE']=preprocessing.scale(predictors['AGE'].astype('float64')) predictors['ALCEVR1']=preprocessing.scale(predictors['ALCEVR1'].astype('float64')) predictors['ALCPROBS1']=preprocessing.scale(predictors['ALCPROBS1'].astype('float64')) predictors['MAREVER1']=preprocessing.scale(predictors['MAREVER1'].astype('float64')) predictors['COCEVER1']=preprocessing.scale(predictors['COCEVER1'].astype('float64')) predictors['INHEVER1']=preprocessing.scale(predictors['INHEVER1'].astype('float64')) predictors['CIGAVAIL']=preprocessing.scale(predictors['CIGAVAIL'].astype('float64')) predictors['DEP1']=preprocessing.scale(predictors['DEP1'].astype('float64')) predictors['ESTEEM1']=preprocessing.scale(predictors['ESTEEM1'].astype('float64')) predictors['VIOL1']=preprocessing.scale(predictors['VIOL1'].astype('float64')) predictors['PASSIST']=preprocessing.scale(predictors['PASSIST'].astype('float64')) predictors['DEVIANT1']=preprocessing.scale(predictors['DEVIANT1'].astype('float64')) predictors['GPA1']=preprocessing.scale(predictors['GPA1'].astype('float64')) predictors['EXPEL1']=preprocessing.scale(predictors['EXPEL1'].astype('float64')) predictors['FAMCONCT']=preprocessing.scale(predictors['FAMCONCT'].astype('float64')) predictors['PARACTV']=preprocessing.scale(predictors['PARACTV'].astype('float64')) predictors['PARPRES']=preprocessing.scale(predictors['PARPRES'].astype('float64'))
# TESTING A LASSO REGRESSION MODEL IN PYTHON
# split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,                                                              test_size=.3, random_state=123
# Specify the Lasso Regression model= LassoLarsCV(cv=10, precompute=False).fit(pred_train , tar_train)
# Print Variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))
Tumblr media
# # MSE FROM TRAINING AND TEST DATA from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train , model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print('training data MSE') print(train_error) print('test data MSE') print (test_error)
Tumblr media
# # R-squared from training and test data
rsquared_train = model.score(pred_train, tar_train) rsquared_test=model.score(pred_test, tar_test) print ('Training data R-squared') print (rsquared_train) print ('Test data R-squared') print (rsquared_test) 
# Plot Coefficient progression m_log_alphas= -np.log(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle ='', color ='k',label ='alpha CV') plt.ylabel('REgression Coefficients') plt.xlabel('-log(alpha)') plt.title('regression Coefficient Progression for Lasso Paths') 
Tumblr media
The plot above shows the value of selecting predictors at any given stage. 
The selection process can therefore take the regression coefficients and slightly change by adding a new predictor at each stage and the stage at which each variable entered the mode. With that in mind there was a list of regression coefficients which was the largest like self_esteem it was entered into the model first then others followed from the largest to the smallest coefficient and so on.
# Plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_,':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1),'k',        label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--',color='k', label = 'alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Means squared error on each fold')
Tumblr media
5 notes · View notes