Tumgik
#200000 node decision tree
angermgnt · 6 months
Text
No, the best answer to anger is understanding!
Tumblr media
The problem w anger, is that instinctively your 20,000 node decision tree has realized there is an inequality happening and you cant do anything about it.
In order to over come the upset feeling, you have to sit w the anger until the tree reveals what made you upset, and document it.
Once you understand why you became angry, you have defused the anger, and can then start the process of addressing the various points that caused the anger.
The more you understand the points in the 20,000 node decision tree, the more that will slowly be exposed. So, prepare to have a conversation w people about the points you have uncovered, and see what additional points are uncovered.
So, while people see that you,re being silent, they are mis understanding how the process of resolving anger works.
You arent being silent, you just arent showing them what you are building ..
Because, what you are most likely getting angry at is .. bad behavior from someone in a position of power .. and wo allies, documentation, witnesses, support, and ways to confront the inequality .. you are just a single person saying ..
Wo ways to confront the inequality .. you are just a single person saying .. What can I do?!?! ...
What can i do? .. You can do a lot when you raise awareness.
1 note · View note
Text
Week 1 Assignment - Running a Classification Tree
Introduction
In this assignment I am going to run a Classification Tree on my explanatory and response variables.  My response variable is a person’s confidence level on becoming wealthy. My explanatory variables are a person’s gender, education level, age category, ethnicity (Non-Hispanic Black and Non-Hispanic White only), MSA Status, and binary annual income level (1 for income level higher than population median income, 0 for lower than or equal to it).  MSA stands for Metropolitan Statistical Areas status. It is a geographical region with a relatively high population density at its core and surrounding communities which is comprised of a minimum population of 50,000 or more people. This variable is called PPMSACAT in my test, and the value of 0 represents Non-Metro living status and 1 for Metro living status. 
The following tables show the statistics of my explanatory variables.  
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
From these explanatory variables I have managed the following two variables to binary categorical variables:
W1_P20 (Annual Income) to BI_INCOME_LEVEL: 0 means income level less than or equal to population median income, that is category 9 ($30,000 to $34,999), and 1 for higher than population median income
-------------------------------------------------------------------------------------
Frequency Table of BI_INCOME_LEVEL
-------------------------------------------------------------------------------------
           Frequency Perc.(%)  Cumulative Frequency Cumulative Perc.(%)
<= Median       1011    49.56                  1011               49.56
> Median        1029    50.44                  2040              100.00
PPETHM (Ethnicity) to ETHN_BW: 0 means Non-Hispanic White ethnicity, and 1 means Non-Hispanic Black
-------------------------------------------------------------------------------------
Frequency Table of ETHN_BW
-------------------------------------------------------------------------------------
       Frequency Perc.(%)  Cumulative Frequency Cumulative Perc.(%)
White        800    39.22                   800               39.22
Black       1240    60.78                  2040              100.00
My response variable originally looks as follows in the table W1_F4_D. However, due to lack of pruning capability in the current SKLearn library in Python, hence trying to avoid overfitting of the resultant tree with so many branches, I have decided to manage my response variable to a binary categorical value, 1 for a person feeling hard to become wealthy, 0 for feeling easy.
Tumblr media
After performing data management operation on it to condense it to two levels (Easy and Hard), the statistics of my response variable looks as follows. 
-------------------------------------------------------------------------------------
Frequency Table of 'To Become Wealthy'
-------------------------------------------------------------------------------------
      Frequency Percentage(%)  Cumulative Frequency Cumulative Percentage(%)
Easy        216          9.69                   216                     9.69
Hard       2013         90.31                  2229                   100.00
(Note: The overall frequency was reduced from 2,294 to 2229 (= 216 + 2013). That’s because this statistic was derived after dropping all of NaN values from my data frame, hence quite a few rows in my original data frame were eliminated from the counting.) 
Program: 
The Python source is appended at the bottom of this report.
Analysis
I have run my Python program, and here is the result:
-------------------------------------------------------------------------------------
Training and Test Data Sets
-------------------------------------------------------------------------------------
pred_train: (1224, 6)
pred_test: (816, 6)
tar_train: (1224,)
tar_test: (816,)
The training sample has 1224 observations or rows, 60% of our original sample, and 6 explanatory variables. The test sample has 816 observations or rows, 40% of the original sample. And again 6 explanatory variables or columns.
-------------------------------------------------------------------------------------
Confusion Matrix and Accuracy Score
-------------------------------------------------------------------------------------
Confusion Matrix: 
[[  5  69]
 [ 31 711]]
Accuracy Score: 0.8774509803921569
The confusion matrix shows the correct and incorrect classifications of our decision tree. The diagonal, 5 and 711, represent the number of true negative for feeling hard on being wealthy, and the number of true positives, respectively. The 31, on the bottom left, represents the number of false negatives - classifying feeling hard on being wealthy as feeling easy on being wealthy. And the 69 on the top right, the number of false positives, classifying feeling easy on being wealthy as feeling hard on being wealthy. 
We can also look at the accuracy score which is approximately 0.87, which suggests that the decision tree model has classified 87% of the sample correctly as either feeling easy or feeling hard on being wealthy. 
The following is the tree graph resulted: 
Tumblr media
The resulting tree starts with the split on X[1], my second explanatory variable, PPEDUCAT, peoples’ educational level, as depicted as blow:
Tumblr media
Left branch and its children branches from the top node:
Tumblr media
From the node of X3 (age category), another split is made on a person’s ethnicity, variable X2, such that among those individuals with less than Bachelor’s degree (X[1] <= 3.5), age younger than approximately 40 (X[3] <= 3.5), 35 of them is belong to White ethnicity group (X[2] <= 0.5), while 486 belong to Black ethnicity group.
From that node of X[2] with 521 samples, we see that 10 of individuals are with below High School education (X[1] <= 1.5), while 220 are higher. 
The interpretation can be continued in this way for all hierarchical levels, and by doing do, we could understand overall characteristics and intrinsic data dependency patterns in the samples, which would definitely essential in building a data model from collected samples. 
------ Source Begin 
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Tue Jan 26 06:14:38 2021
@author: ggonecrane """
# -*- coding: utf-8 -*-
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import sklearn.metrics from IPython.display import display, Image from sklearn import tree from io import StringIO import pydotplus
#============================================= # Methods Defined for Utility Purposes #=============================================
def printTableLabel(label):    print('\n')    print('-------------------------------------------------------------------------------------')    print(f'\t\t\t\t{label}')    print('-------------------------------------------------------------------------------------')
def binary_income(value, median_inc):    if value > median_inc:        return 1    if value > 0:        return 0    return value
def print_expVar_stat(data, exp_var, index_str, head_label):    w_countsT = data[exp_var].value_counts().sort_index()    w_percT = data[exp_var].value_counts(normalize=True).sort_index()    df2 =  w_countsT.to_frame()    df2['perc'] = w_percT.apply(lambda x: '{:.2f}'.format(round(x * 100, 2)))    df2['cum_sum'] = w_countsT.cumsum()    df2['cum_perc'] = w_percT.cumsum().apply(lambda x: '{:.2f}'.format(round(x * 100, 2)))    df2.index = index_str    df2.columns = ['Frequency', 'Perc.(%)', 'Cumulative Frequency', 'Cumulative Perc.(%)']
   printTableLabel(f'Frequency Table of {head_label}')    print(df2)
def print_responseVar_stat(data, response_var):    w_countsT = data[response_var].value_counts().sort_index()    w_percT = data[response_var].value_counts(normalize=True).sort_index()    df2 =  w_countsT.to_frame()    df2['perc'] = w_percT.apply(lambda x: '{:.2f}'.format(round(x * 100, 2)))    df2['cum_sum'] = w_countsT.cumsum()    df2['cum_perc'] = w_percT.cumsum().apply(lambda x: '{:.2f}'.format(round(x * 100, 2)))    df2.index = ['Easy', 'Hard']    df2.columns = ['Frequency', 'Percentage(%)', 'Cumulative Frequency', 'Cumulative Percentage(%)']
   printTableLabel("Frequency Table of 'To Become Wealthy'")    print(df2)
#============================================= # Variable Declarations #=============================================
mapped_income = 'MAPPED_INCOME' income_col = 'W1_P20' bi_income_col = 'BI_INCOME_LEVEL' wealthy_col = 'Hard2B_Rich' col_edu_col = 'College_Grad' bw_ethn_col = 'ETHN_BW'
#============================================= # Data Management Conversion Matrix #=============================================
wealthy_difficulty = {1: 1, 2: 1, 3: 0, 4: 0, -1:-1} col_educ = {1:0, 2: 0, 3: 1, 4: 1} bw_ethn = {1:0, 2: 1, 3: -1, 4: -1, 5: -1} approx_income = {1: 2500, 2: 6250, 3: 8750, 4: 11250, 5: 13750, 6: 17500, 7: 22500,                 8: 27500, 9:32500, 10: 37500, 11: 45000, 12: 55000, 13: 67500, 14: 80000,                 15: 92500, 16: 112500, 17: 137500, 18: 162500, 19: 200000, -1: 0}
#============================================= # Data Loading and Management #=============================================
data = pd.read_csv("OutlookLife.csv")
data[mapped_income]= data[income_col].map(approx_income)
median_inc = data[mapped_income].median() # re-code annual income to bi-level (higher or lower than median value) data[bi_income_col] = data[mapped_income].apply(lambda x: binary_income(x, median_inc) )
data[wealthy_col] = data['W1_F4_D'].map(wealthy_difficulty)
# data[col_edu_col] = data['PPEDUCAT'].map(col_educ) data[bw_ethn_col] = data['PPETHM'].map(bw_ethn)
sub1 = data[(data[wealthy_col]>=0) & (data[bi_income_col]>=0) & (data[bw_ethn_col]>=0)].copy()
#============================================= #Print Managed Variable Stats #============================================= print_expVar_stat(sub1, bi_income_col, ["<= Median", "> Median"], 'BI_INCOME_LEVEL') print_expVar_stat(sub1, bw_ethn_col, ["White", "Black"], 'ETHN_BW')
print_responseVar_stat(sub1, wealthy_col)
#============================================= # Modeling and Prediction #=============================================
predictors = sub1[['PPGENDER', 'PPEDUCAT', bw_ethn_col, 'PPAGECAT',                     'PPMSACAT',bi_income_col]] # col_edu_col,bw_ethn_col,
targets = sub1[wealthy_col]
#Split into training and testing sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
a = pred_train.shape b = pred_test.shape c = tar_train.shape d = tar_test.shape
printTableLabel('Training and Test Data Sets') print(f'pred_train: {a}') print(f'pred_test: {b}') print(f'tar_train: {c}') print(f'tar_test: {d}')
#============================================= # Build model on training data #=============================================
classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
conf_mat = sklearn.metrics.confusion_matrix(tar_test,predictions) accu_score = sklearn.metrics.accuracy_score(tar_test, predictions)
printTableLabel('Confusion Matrix and Accuracy Score') print(f'Confusion Matrix: \n{conf_mat}') print(f'Accuracy Score: {accu_score}')
#============================================= # Build and Display Tree Graph #=============================================
out = StringIO() tree.export_graphviz(classifier, out_file=out) graph=pydotplus.graph_from_dot_data(out.getvalue()) image = Image(graph.create_png()) display(image)
---- Source End
<The End>
0 notes