Week 1 Assignment - Running a Classification Tree
Introduction
In this assignment I am going to run a Classification Tree on my explanatory and response variables. My response variable is a person’s confidence level on becoming wealthy. My explanatory variables are a person’s gender, education level, age category, ethnicity (Non-Hispanic Black and Non-Hispanic White only), MSA Status, and binary annual income level (1 for income level higher than population median income, 0 for lower than or equal to it). MSA stands for Metropolitan Statistical Areas status. It is a geographical region with a relatively high population density at its core and surrounding communities which is comprised of a minimum population of 50,000 or more people. This variable is called PPMSACAT in my test, and the value of 0 represents Non-Metro living status and 1 for Metro living status.
The following tables show the statistics of my explanatory variables.
From these explanatory variables I have managed the following two variables to binary categorical variables:
W1_P20 (Annual Income) to BI_INCOME_LEVEL: 0 means income level less than or equal to population median income, that is category 9 ($30,000 to $34,999), and 1 for higher than population median income
-------------------------------------------------------------------------------------
Frequency Table of BI_INCOME_LEVEL
-------------------------------------------------------------------------------------
Frequency Perc.(%) Cumulative Frequency Cumulative Perc.(%)
<= Median 1011 49.56 1011 49.56
> Median 1029 50.44 2040 100.00
PPETHM (Ethnicity) to ETHN_BW: 0 means Non-Hispanic White ethnicity, and 1 means Non-Hispanic Black
-------------------------------------------------------------------------------------
Frequency Table of ETHN_BW
-------------------------------------------------------------------------------------
Frequency Perc.(%) Cumulative Frequency Cumulative Perc.(%)
White 800 39.22 800 39.22
Black 1240 60.78 2040 100.00
My response variable originally looks as follows in the table W1_F4_D. However, due to lack of pruning capability in the current SKLearn library in Python, hence trying to avoid overfitting of the resultant tree with so many branches, I have decided to manage my response variable to a binary categorical value, 1 for a person feeling hard to become wealthy, 0 for feeling easy.
After performing data management operation on it to condense it to two levels (Easy and Hard), the statistics of my response variable looks as follows.
-------------------------------------------------------------------------------------
Frequency Table of 'To Become Wealthy'
-------------------------------------------------------------------------------------
Frequency Percentage(%) Cumulative Frequency Cumulative Percentage(%)
Easy 216 9.69 216 9.69
Hard 2013 90.31 2229 100.00
(Note: The overall frequency was reduced from 2,294 to 2229 (= 216 + 2013). That’s because this statistic was derived after dropping all of NaN values from my data frame, hence quite a few rows in my original data frame were eliminated from the counting.)
Program:
The Python source is appended at the bottom of this report.
Analysis
I have run my Python program, and here is the result:
-------------------------------------------------------------------------------------
Training and Test Data Sets
-------------------------------------------------------------------------------------
pred_train: (1224, 6)
pred_test: (816, 6)
tar_train: (1224,)
tar_test: (816,)
The training sample has 1224 observations or rows, 60% of our original sample, and 6 explanatory variables. The test sample has 816 observations or rows, 40% of the original sample. And again 6 explanatory variables or columns.
-------------------------------------------------------------------------------------
Confusion Matrix and Accuracy Score
-------------------------------------------------------------------------------------
Confusion Matrix:
[[ 5 69]
[ 31 711]]
Accuracy Score: 0.8774509803921569
The confusion matrix shows the correct and incorrect classifications of our decision tree. The diagonal, 5 and 711, represent the number of true negative for feeling hard on being wealthy, and the number of true positives, respectively. The 31, on the bottom left, represents the number of false negatives - classifying feeling hard on being wealthy as feeling easy on being wealthy. And the 69 on the top right, the number of false positives, classifying feeling easy on being wealthy as feeling hard on being wealthy.
We can also look at the accuracy score which is approximately 0.87, which suggests that the decision tree model has classified 87% of the sample correctly as either feeling easy or feeling hard on being wealthy.
The following is the tree graph resulted:
The resulting tree starts with the split on X[1], my second explanatory variable, PPEDUCAT, peoples’ educational level, as depicted as blow:
Left branch and its children branches from the top node:
From the node of X3 (age category), another split is made on a person’s ethnicity, variable X2, such that among those individuals with less than Bachelor’s degree (X[1] <= 3.5), age younger than approximately 40 (X[3] <= 3.5), 35 of them is belong to White ethnicity group (X[2] <= 0.5), while 486 belong to Black ethnicity group.
From that node of X[2] with 521 samples, we see that 10 of individuals are with below High School education (X[1] <= 1.5), while 220 are higher.
The interpretation can be continued in this way for all hierarchical levels, and by doing do, we could understand overall characteristics and intrinsic data dependency patterns in the samples, which would definitely essential in building a data model from collected samples.
------ Source Begin
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 26 06:14:38 2021
@author: ggonecrane
"""
# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics
from IPython.display import display, Image
from sklearn import tree
from io import StringIO
import pydotplus
#=============================================
# Methods Defined for Utility Purposes
#=============================================
def printTableLabel(label):
print('\n')
print('-------------------------------------------------------------------------------------')
print(f'\t\t\t\t{label}')
print('-------------------------------------------------------------------------------------')
def binary_income(value, median_inc):
if value > median_inc:
return 1
if value > 0:
return 0
return value
def print_expVar_stat(data, exp_var, index_str, head_label):
w_countsT = data[exp_var].value_counts().sort_index()
w_percT = data[exp_var].value_counts(normalize=True).sort_index()
df2 = w_countsT.to_frame()
df2['perc'] = w_percT.apply(lambda x: '{:.2f}'.format(round(x * 100, 2)))
df2['cum_sum'] = w_countsT.cumsum()
df2['cum_perc'] = w_percT.cumsum().apply(lambda x: '{:.2f}'.format(round(x * 100, 2)))
df2.index = index_str
df2.columns = ['Frequency', 'Perc.(%)', 'Cumulative Frequency', 'Cumulative Perc.(%)']
printTableLabel(f'Frequency Table of {head_label}')
print(df2)
def print_responseVar_stat(data, response_var):
w_countsT = data[response_var].value_counts().sort_index()
w_percT = data[response_var].value_counts(normalize=True).sort_index()
df2 = w_countsT.to_frame()
df2['perc'] = w_percT.apply(lambda x: '{:.2f}'.format(round(x * 100, 2)))
df2['cum_sum'] = w_countsT.cumsum()
df2['cum_perc'] = w_percT.cumsum().apply(lambda x: '{:.2f}'.format(round(x * 100, 2)))
df2.index = ['Easy', 'Hard']
df2.columns = ['Frequency', 'Percentage(%)', 'Cumulative Frequency', 'Cumulative Percentage(%)']
printTableLabel("Frequency Table of 'To Become Wealthy'")
print(df2)
#=============================================
# Variable Declarations
#=============================================
mapped_income = 'MAPPED_INCOME'
income_col = 'W1_P20'
bi_income_col = 'BI_INCOME_LEVEL'
wealthy_col = 'Hard2B_Rich'
col_edu_col = 'College_Grad'
bw_ethn_col = 'ETHN_BW'
#=============================================
# Data Management Conversion Matrix
#=============================================
wealthy_difficulty = {1: 1, 2: 1, 3: 0, 4: 0, -1:-1}
col_educ = {1:0, 2: 0, 3: 1, 4: 1}
bw_ethn = {1:0, 2: 1, 3: -1, 4: -1, 5: -1}
approx_income = {1: 2500, 2: 6250, 3: 8750, 4: 11250, 5: 13750, 6: 17500, 7: 22500,
8: 27500, 9:32500, 10: 37500, 11: 45000, 12: 55000, 13: 67500, 14: 80000,
15: 92500, 16: 112500, 17: 137500, 18: 162500, 19: 200000, -1: 0}
#=============================================
# Data Loading and Management
#=============================================
data = pd.read_csv("OutlookLife.csv")
data[mapped_income]= data[income_col].map(approx_income)
median_inc = data[mapped_income].median()
# re-code annual income to bi-level (higher or lower than median value)
data[bi_income_col] = data[mapped_income].apply(lambda x: binary_income(x, median_inc) )
data[wealthy_col] = data['W1_F4_D'].map(wealthy_difficulty)
# data[col_edu_col] = data['PPEDUCAT'].map(col_educ)
data[bw_ethn_col] = data['PPETHM'].map(bw_ethn)
sub1 = data[(data[wealthy_col]>=0) & (data[bi_income_col]>=0) & (data[bw_ethn_col]>=0)].copy()
#=============================================
#Print Managed Variable Stats
#=============================================
print_expVar_stat(sub1, bi_income_col, ["<= Median", "> Median"], 'BI_INCOME_LEVEL')
print_expVar_stat(sub1, bw_ethn_col, ["White", "Black"], 'ETHN_BW')
print_responseVar_stat(sub1, wealthy_col)
#=============================================
# Modeling and Prediction
#=============================================
predictors = sub1[['PPGENDER', 'PPEDUCAT', bw_ethn_col, 'PPAGECAT',
'PPMSACAT',bi_income_col]] # col_edu_col,bw_ethn_col,
targets = sub1[wealthy_col]
#Split into training and testing sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
a = pred_train.shape
b = pred_test.shape
c = tar_train.shape
d = tar_test.shape
printTableLabel('Training and Test Data Sets')
print(f'pred_train: {a}')
print(f'pred_test: {b}')
print(f'tar_train: {c}')
print(f'tar_test: {d}')
#=============================================
# Build model on training data
#=============================================
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
conf_mat = sklearn.metrics.confusion_matrix(tar_test,predictions)
accu_score = sklearn.metrics.accuracy_score(tar_test, predictions)
printTableLabel('Confusion Matrix and Accuracy Score')
print(f'Confusion Matrix: \n{conf_mat}')
print(f'Accuracy Score: {accu_score}')
#=============================================
# Build and Display Tree Graph
#=============================================
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
graph=pydotplus.graph_from_dot_data(out.getvalue())
image = Image(graph.create_png())
display(image)
---- Source End
<The End>
0 notes