drsuthar - Tumblr blog

drsuthar · 7 years ago

Text

Childhood Obesity and Social Deprivation in England

Test a Logistic Regression ModelDataset:

2015 childhood obesity, various measures of social deprivation, and income in England. Research topic: I have chosen to investigate the association between childhood obesity and absolute social deprivation as measured by the IMD and the other indices of social deprivation, and by GDHI. Code 📓: https://www.dropbox.com/s/r46h2imc95m9ycm/Childhood%20Obesity%20and%20Social%20Deprivation%20Study_V01a.docx?dl=0 This week’s Python code on GitHub: https://github.com/rusmat0173/WesleyanDataMOOC/blob/master/WeslyanDataMOOC_Jupyter_notebook.ipynb

Week Four activities:

> Task: Write a blog entry that summarizes in a few sentences 1) what you found in your multiple regression analysis. Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary. 2) Report whether your results supported your hypothesis for the association between your primary explanatory variable and the response variable. 3) Discuss whether there was evidence of confounding for the association between your primary explanatory and response variable.All the explanatory variables in the working dataframe are quantitative and data is cleaned up. Initial task is to create a bespoke dataframe just of the explanatory variables, to later do a check of correlation between theses variables, to try to minimise multicollinearity.

Data Notes

I need to create a binary response variable. I will use ‘Crime - Average score’ as the basis for binary response variable but it needs to be converted from a quantitive variable. To make more 'binarily’ interesting, I will transform the highest quartile of 'Crime - Average score’ into a High Crime = 1 categoric variable. The lower 3 quartiles of 'Crime - Average score’ will be High Crime = 0.

Firstly I want to create another (slimmer) dataframe to work on, and take out all the spaces, commas and dashes in string names. Finally check by printing headers.

Now use 'CrimeAverageScore’ to create a binary categoric variable as described above.

Now test whether binary categorisation works properly.

OK, that all works and we have a binary response variable 'CrimeCat’. (cutoff is given as 0.483, above.)

Logistic Regression

Now to choose initial explanatory variables: I’ll choose ['GDHI’, 'IncomeAveragescore’,'EmploymentAveragescore’,'LivingEnvironmentAveragescore’] as initial list, in that order.

Result very interesting: coefficient is very low, p-value very high and odds ratios nearly = 1 (with 95% conidence interval). This shows that there is almost no association between GDHI (Gross Domestic Household Income in an English local authority area) and the high crime binary categorical variable in a local authority area.

So a hypothesis of a link between high Crime and low income (GDHI) is not at all supported.

So repeat this with 'IncomeAveragescore’.

Result also very interesting! Coefficient is very high at nearly 36, with a 95% C.I. between 25.5 and 46. The p-value is very low and odds ratios huge at ~3.5e+15. This shows that there is a very high association between IncomeAveragescore and the high crime binary categorical variable in a local authority area. A 1 unit increase in Income deprivation increases the odds of being in the high crime category by 35 times.

So a hypothesis of a link between high Crime and high Income deprivation is well supported.

Now add 'EmploymentAveragescore’ to test if a confounding variable.

Adding 'EmploymentAverageScore’ has improved the Pseudo R-squared coefficient to 55% from 34% for the previous model. The 'EmploymentAverageScore’ is statistically siginificant on its own. Let’s initially look at impact on already-in 'IncomeAveragescore’ variable. This is still statistically significant (p-value <0.001) and the coefficient is higher. Its odds ratio (controlling for 'EmploymentAverageScore’) is even higher at 5.6e+49. Summary: 'EmploymentAverageScore’ is a confounding variable as it has significantly changed the parameters for 'IncomeAveragescore’ whilst being very significant itself. Notably the coefficient is strongly negative. That says ~“as Employment deprivation gets worse, the chances of being in a high crime area are lower.” This is somewhat counter-intuitive, but does tie-in with the increase in postive odds ratio (& coefficient) for 'IncomeAveragescore’.

Further logistic regressions

Adding 'LivingEnvironmentAveragescore’ as quantitative variable.

Adding 'LivingEnvironmentAveragescore’ has had no impact on Pseudo R-squared, is not statistically significant (p-value of 0.527) and has an odds ratio ~= 1 (at 0.963). So this is not a confounding variable and does not improve the model.

I repeated this analysis also for BarrierstoHousingandServicesAveragescore. The result was very similar, also statistically insiginificant. (Same when including either of 'EducationSkillsandTrainingAveragescore’ and 'HealthDeprivationandDisabilityAveragescore’.)

Summary: So essentially the model of explaining binary categorical variable CrimeCat only works for 'IncomeAveragescore’ and'EmploymentAveragescore’, though the latter (counter-intuitively) in a negative sense.

0 notes

drsuthar · 7 years ago

Text

Week 3

0 notes

drsuthar · 7 years ago

Text

Week 3

import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV

#Load the dataset nsc = pd.read_csv(“nesarc_pds.csv”, low_memory=False)

#Data management and cleaning nsc.columns = map(str.upper, nsc.columns) names = [‘AGE’,‘S4AQ7’,'SEX’,'S4AQ6A’,'S2BQ2D’,'S2BQ2E’,'S2BQ2FR’,'S2BQ3A’,'S2BQ3B’,'S2BQ3CR’,'ETHRACE2A’,'S1Q5A’] nsc = nsc[names] nsc = nsc.copy() for col in names: nsc[col] = pd.to_numeric(nsc[col],errors='coerce’)

nsc['S4AQ7’] = nsc['S4AQ7’].replace(99,np.nan) nsc['S4AQ6A’] = nsc['S4AQ6A’].replace(99,np.nan) nsc['S2BQ2D’] = nsc['S2BQ2D’].replace(99,np.nan) nsc['S2BQ2E’] = nsc['S2BQ2E’].replace(99,np.nan) nsc['S2BQ2FR’] = nsc['S2BQ2FR’].replace(99,np.nan) nsc['S2BQ3A’] = nsc['S2BQ3A’].replace(99,np.nan) nsc['S2BQ3B’] = nsc['S2BQ3B’].replace(99,np.nan) nsc['S2BQ3CR’] = nsc['S2BQ3CR’].replace(999,np.nan) nsc['S1Q5A’] = nsc['S1Q5A’].replace(99,np.nan) nsc['SEX’] = nsc['SEX’].astype('category’) nsc['ETHRACE2A’] = nsc['ETHRACE2A’].astype('category’) clean = nsc.dropna() n = clean[(clean['AGE’]>25)] # Subsetting for people older than 25 years old

# Variables of interest “”“ S4AQ7 NUMBER OF EPISODES (DEPRESSION) SEX S4AQ6A AGE AT ONSET OF FIRST EPISODE S2BQ2D AGE AT ONSET OF ALCOHOL DEPENDENCE S2BQ2E NUMBER OF EPISODES OF ALCOHOL DEPENDENCE S2BQ2FR DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL DEPENDENCE (BASED ON S2BQ2H IF ONLY 1 EPISODE) S2BQ3A AGE AT ONSET OF ALCOHOL ABUSE S2BQ3B NUMBER OF EPISODES OF ALCOHOL ABUSE S2BQ3CR DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL ABUSE ETHRACE2A IMPUTED RACE/ETHNICITY S1Q5A NUMBER OF CHILDREN EVER HAD ”“”

#select predictor variables and target variable as separate data sets exp = n[['SEX’,'S4AQ6A’,'S2BQ2D’,'S2BQ2E’,'S2BQ2FR’,'S2BQ3A’,'S2BQ3B’,'S2BQ3CR’,'ETHRACE2A’,'S1Q5A’]] dep = n.S4AQ7

# standardize predictors to have mean=0 and sd=1 predictors=exp.copy() from sklearn import preprocessing predictors['S4AQ6A’]=preprocessing.scale(predictors['S4AQ6A’].astype('float64’)) predictors['S2BQ2D’]=preprocessing.scale(predictors['S2BQ2D’].astype('float64’)) predictors['S2BQ2FR’]=preprocessing.scale(predictors['S2BQ2FR’].astype('float64’)) predictors['S2BQ3A’]=preprocessing.scale(predictors['S2BQ3A’].astype('float64’)) predictors['S2BQ2E’]=preprocessing.scale(predictors['S2BQ2E’].astype('float64’)) predictors['S2BQ3CR’]=preprocessing.scale(predictors['S2BQ3CR’].astype('float64’)) predictors['S1Q5A’]=preprocessing.scale(predictors['S1Q5A’].astype('float64’)) predictors['S2BQ3B’]=preprocessing.scale(predictors['S2BQ3B’].astype('float64’))

# split data into train and test sets exp_train, exp_test, dep_train, dep_test = train_test_split(predictors, dep, test_size=.3, random_state=123)

# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(exp_train,dep_train)

# print variable names and regression coefficients print(“Regression coefficients:\n”,dict(zip(predictors.columns, model.coef_)))

# plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle=’–’, color='k’, label='alpha CV’) plt.ylabel('Regression Coefficients’) plt.xlabel(’-log(alpha)’) plt.title('Regression Coefficients Progression for Lasso Paths’)

# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ’:’) plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k’, label='Average across the folds’, linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle=’–’, color='k’, label='alpha CV’) plt.legend() plt.xlabel(’-log(alpha)’) plt.ylabel('Mean squared error’) plt.title('Mean squared error on each fold’)

# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(dep_train, model.predict(exp_train)) test_error = mean_squared_error(dep_test, model.predict(exp_test)) print ('training data MSE’) print(train_error) print ('test data MSE’) print(test_error)

# R-square from training and test data rsquared_train=model.score(exp_train,dep_train) rsquared_test=model.score(exp_test,dep_test) print ('training data R-square’) print(rsquared_train) print ('test data R-square’) print(rsquared_test)

Output:

Regression coefficients: {'SEX’: 0.0, 'S4AQ6A’: -3.3549670327445114, 'S2BQ2D’: 0.3905379532363508, 'S2BQ2E’: 0.974223803746794, 'S2BQ2FR’: 0.0, 'S2BQ3A’: 0.4900861074604128, 'S2BQ3B’: 2.430554111129614, 'S2BQ3CR’: 0.8887820963998001, 'ETHRACE2A’: -0.043225269741326254, 'S1Q5A’: 0.2994884387283313}

training data MSE 263.10125855955783

test data MSE 211.5459199323467

training data R-square 0.08476097792937896

test data R-square 0.10089940145604315

Data is subsetted for people above age 25.

SEX and DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL DEPENDENCE have insignificant influence on NUMBER OF EPISODES (DEPRESSION), and thus are given weight of 0 by the lasso regression model. S4AQ6A - AGE AT ONSET OF FIRST EPISODE has negative impact on NUMBER OF EPISODES (DEPRESSION). The lower the age at onset, the higher the number of episodes of depression. S2BQ3B NUMBER OF EPISODES OF ALCOHOL ABUSE is positively related to the dependent variable.

R squared is 8.5% i.e. the fitted model explains 8.5% of variation in the dependent variable. R squared on test data is 10%. This exhibits lasso regression’s ability to generalize well on unseen data.

Regression coefficients increase in size and reach a saturation along the path. Insignificant variables are given a coefficient of 0. Mean square error for each fold regresses toward the average mean squared error over time. Average MSE drops and then stays constant along the path.

0 notes

drsuthar · 7 years ago

Text

Week 4

from pandas import DataFrame, Series import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans

# Data Management nsc = pd.read_csv(“nesarc_pds.csv”, low_memory=False) nsc[‘S2AQ20’] = nsc[‘S2AQ20’].replace(99,np.nan) nsc['S2BQ2D’] = nsc['S2BQ2D’].replace(99,np.nan) nsc['S2BQ2E’] = nsc['S2BQ2E’].replace(99,np.nan) nsc['S2BQ2FR’] = nsc['S2BQ2FR’].replace(999,np.nan) nsc['S4AQ7’] = nsc['S4AQ7’].replace(99,np.nan) nsc['S4AQ9DR’] = nsc['S4AQ9DR’].replace([9997,9998,9999],np.nan) nsc['ETOTLCA2’] = pd.to_numeric(nsc['ETOTLCA2’], errors='coerce’) var = [“S2AQ20”,“S2BQ2D”,“S2BQ2E”,“S2BQ2FR”,“S4AQ7”,“S4AQ9DR”,“ETOTLCA2”] var1 = [“S2AQ20”,“S2BQ2D”,“S2BQ2E”,“S2BQ2FR”,“S4AQ7”,“S4AQ9DR”,“ETOTLCA2”,“AGE”]

for i in var1: nsc[i] = pd.to_numeric(nsc[i], errors='coerce’)

c1 = nsc[var1] c1 = c1.dropna() c = nsc[var] c = c.dropna() cpy = c.copy()

# Scaling for j in var: cpy[j]=preprocessing.scale(cpy[j].astype('float64’))

“”“ Clustering Variables S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING

S2BQ2D AGE AT ONSET OF ALCOHOL DEPENDENCE

S2BQ2E NUMBER OF EPISODES OF ALCOHOL DEPENDENCE

S2BQ2FR DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL DEPENDENCE

S4AQ7 NUMBER OF EPISODES (DEPRESSION)

S4AQ9DR DURATION (WEEKS) OF ONLY/LONGEST EPISODE (DEPRESSION)

ETOTLCA2 AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL TYPES OF ALCOHOLIC BEVERAGES COMBINED ”“” # train and test split train, test = train_test_split(cpy, test_size=.3, random_state=123)

# k-means clustering from scipy.spatial.distance import cdist clusters = range(1,20) meandist = []

for k in clusters: model = KMeans(n_clusters=k) model.fit(train) cls = model.predict(train) meandist.append(sum(np.min(cdist(train, model.cluster_centers_, 'euclidean’), axis=1)) / train.shape[0])

“”“ Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose ”“”

plt.plot(clusters, meandist) plt.xlabel('Number of clusters’) plt.ylabel('Average distance’) plt.title('Selecting k with the Elbow Method’)

# Interpret 6 cluster solution model6=KMeans(n_clusters=6) model6.fit(train) clusassign=model6.predict(train)

# plot clusters from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model6.labels_,) plt.xlabel('Canonical variable 1’) plt.ylabel('Canonical variable 2’) plt.title('Scatterplot of Canonical Variables for 6 Clusters’) plt.show()

“”“ BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster ”“” # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(train['index’]) # create a list of cluster assignments labels=list(model6.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index’) newclus # rename the cluster assignment column newclus.columns = ['cluster’]

# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pd.merge(train, newclus, on='index’) merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()

“”“ END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster ”“”

# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster’).mean() print (“Clustering variable means by cluster”) print(clustergrp)

# validate clusters in training data by examining cluster differences in AGE using ANOVA # first have to merge AGE with clustering variables and cluster assignment data data=c1['AGE’] # split GPA data into train and test sets n_train, n_test = train_test_split(data, test_size=.3, random_state=123) n_train1=pd.DataFrame(n_train) n_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(n_train1, merged_train, on='index’) sub1 = merged_train_all[['AGE’, 'cluster’]].dropna()

import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

gpamod = smf.ols(formula='AGE ~ C(cluster)’, data=sub1).fit() print (gpamod.summary())

print ('means for by cluster’) m1= sub1.groupby('cluster’).mean() print (m1)

print ('standard deviations for by cluster’) m2= sub1.groupby('cluster’).std() print (m2)

mc1 = multi.MultiComparison(sub1['AGE’], sub1['cluster’]) res1 = mc1.tukeyhsd() print(res1.summary())

Clustering variable means by cluster index S2AQ20 S2BQ2D S2BQ2E S2BQ2FR S4AQ7 \ cluster 0 20519.575910 -0.123316 -0.396928 -0.214280 -0.164058 -0.276760 1 21446.183575 0.005736 1.371894 -0.224220 -0.249446 -0.265347 2 18984.911765 1.091913 0.572542 1.876836 5.140055 0.717593 3 19319.125000 1.332112 0.452860 0.550371 0.748134 1.875452 4 18855.412698 0.035524 0.091741 -0.222021 -0.175676 3.286241 5 18604.514286 0.020961 0.060241 4.531810 -0.114797 0.664498

S4AQ9DR ETOTLCA2 cluster 0 -0.154849 -0.075416 1 -0.147045 0.229559 2 -0.126325 0.013964 3 6.630989 -0.088925 4 -0.085770 -0.027003 5 -0.134659 0.014416

OLS Regression Results ============================================================================== Dep. Variable: AGE R-squared: 0.156 Model: OLS Adj. R-squared: 0.152 Method: Least Squares F-statistic: 42.61 Date: Tue, 22 May 2018 Prob (F-statistic): 2.29e-40 Time: 17:55:46 Log-Likelihood: -4492.7 No. Observations: 1160 AIC: 8997. Df Residuals: 1154 BIC: 9028. Df Model: 5 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ———————————————————————————– Intercept 34.6186 0.413 83.774 0.000 33.808 35.429 C(cluster)[T.1] 12.8066 0.910 14.072 0.000 11.021 14.592 C(cluster)[T.2] 7.2932 2.043 3.570 0.000 3.285 11.302 C(cluster)[T.3] 8.0481 2.417 3.330 0.001 3.306 12.790 C(cluster)[T.4] 6.0322 1.527 3.951 0.000 3.037 9.028 C(cluster)[T.5] 4.8671 2.015 2.416 0.016 0.914 8.820 ============================================================================== Omnibus: 87.652 Durbin-Watson: 1.892 Prob(Omnibus): 0.000 Jarque-Bera (JB): 106.462 Skew: 0.706 Prob(JB): 7.62e-24 Kurtosis: 3.460 Cond. No. 7.24 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for by cluster AGE cluster 0 34.618570 1 47.425121 2 41.911765 3 42.666667 4 40.650794 5 39.485714 standard deviations for by cluster AGE cluster 0 11.702770 1 10.248511 2 14.577624 3 13.143577 4 13.203322 5 11.647700 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================== group1 group2 meandiff lower upper reject ———————————————- 0 1 12.8066 10.2087 15.4044 True 0 2 7.2932 1.4616 13.1248 True 0 3 8.0481 1.149 14.9472 True 0 4 6.0322 1.674 10.3904 True 0 5 4.8671 -0.884 10.6183 False 1 2 -5.5134 -11.6756 0.6489 False 1 3 -4.7585 -11.9392 2.4223 False 1 4 -6.7743 -11.566 -1.9827 True 1 5 -7.9394 -14.0256 -1.8532 True 2 3 0.7549 -8.1233 9.6331 False 2 4 -1.261 -8.3475 5.8255 False 2 5 -2.4261 -10.4448 5.5927 False 3 4 -2.0159 -10.0039 5.9722 False 3 5 -3.181 -12.0065 5.6446 False 4 5 -1.1651 -8.1855 5.8554 False ———————————————-

Summary:

Clustering was done using the following variables:

S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING

S2BQ2D AGE AT ONSET OF ALCOHOL DEPENDENCE

S2BQ2E NUMBER OF EPISODES OF ALCOHOL DEPENDENCE

S2BQ2FR DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL DEPENDENCE

S4AQ7 NUMBER OF EPISODES (DEPRESSION)

S4AQ9DR DURATION (WEEKS) OF ONLY/LONGEST EPISODE (DEPRESSION)

ETOTLCA2 AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL TYPES OF ALCOHOLIC BEVERAGES COMBINED

Elbow curve suggests segmenting data into 6 clusters.

Two-dimensional representation of the 6 clusters gives two closely packed clusters and 4 loose clusters

Performing ANOVA to examine cluster differences in age reveals a significant relationship between cluster number and age as the p-value is 2.29e-40 (<.05)

HSD post-hoc test gives significant mean differences of age between clusters

0-1, 0-2,0-3,0-4,1-4,1-5

0 notes

drsuthar · 7 years ago

Text

Week 2: Random Forest

import pandas as pd import numpy as np import statsmodels.api as sm from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from pandas import DataFrame, Series import matplotlib.pylab as plt from sklearn.metrics import classification_report import sklearn.metrics

# Feature importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier

# Loading data nsc = pd.read_csv(“nesarc_pds.csv”,low_memory=False) nsc.dtypes

# converting all working variables to numeric for i in [‘AGE’,‘SEX’,'MARITAL’,'S2BQ1A17’,'ALCABDEP12DX’,'TAB12MDX’]: nsc[i] = pd.to_numeric(nsc[i],errors='coerce’)

nsc['S2BQ1A17’] = nsc['S2BQ1A17’].replace(9,np.nan) clean = nsc.dropna() exp = clean[['AGE’,'SEX’,'MARITAL’,'S2BQ1A17’,'ALCABDEP12DX’,'TAB12MDX’]] dep = clean.MAJORDEP12 exp_train,exp_test,dep_train,dep_test = train_test_split(exp, dep, test_size=.3) print(exp_train.shape) print(exp_test.shape) print(dep_train.shape) print(dep_test.shape)

# Building model from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators=20) classifier = classifier.fit(exp_train,dep_train) predictions = classifier.predict(exp_test)

print(“\nConfusion matrix: \n”,sklearn.metrics.confusion_matrix(dep_test,predictions)) print(“\nAccruracy = ”,sklearn.metrics.accuracy_score(dep_test,predictions)*100,“%”)

# Extra trees model extrees = ExtraTreesClassifier() extrees.fit(exp_train,dep_train) # relative importance of each variable print(“Relative importance of variable: \n”,extrees.feature_importances_,“ in order \n”,exp.columns)

# Testing accuracy on varying number of trees trees = range(20) acc = np.zeros(20) for idx in range(20): classifier = RandomForestClassifier(n_estimators=idx+1) classifier = classifier.fit(exp_train,dep_train) preds = classifier.predict(exp_test) acc[idx] = sklearn.metrics.accuracy_score(dep_test,preds)

plt.cla() plt.plot(trees,acc)

Output:(342, 6)

(147, 6)

(342,)

(147,)

Confusion matrix:

[[94 13]

[31 9]]

Accruracy = 70.06802721088435 %

Relative importance of variable:

[0.42760142 0.05358502 0.23470483 0.0770946 0.15433372 0.05268041] in order

Index(['AGE’, 'SEX’, 'MARITAL’, 'S2BQ1A17’, 'ALCABDEP12DX’, 'TAB12MDX’], dtype='object’)

Out[241]: [<matplotlib.lines.Line2D at 0x7f1d257ae240>]

Summary:

Accuracy of our fitted random forest model with 20 trees is around 70%, which is over 5% better than what we got with decision tree. Thus we can conclude that random forest model is capable of surpassing a decision tree model in correctly predicting classes.

Descending order of importance of explanatory variables on dependent variable(major depression in last 12 months) :

AGE>MARITAL STATUS>ALCOHOL DEPENDENCE LAST 12 MONTHS>EVER CONTINUE TO DRINK EVEN THOUGH CAUSING HEALTH PROBLEM>SEX>NICOTINE DEPENDENCE LAST 12 MONTHS

It was observed that increasing number of trees in the forest initially increases accuracy but further increments cause volatility in accuracy scores and there is no guarantee of increase in accuracy

0 notes

drsuthar · 7 years ago

Text

Week 1:Machine Learning - Decision Trees

This is the first task of the Machine Learning Course.

Here are my variables:

Income , which is an Explanatory Variable Alcohol, also an Explanatory Variable Life, which is a Response Variable

Decision Tree

This is how the decision tree looks like :

Interpretation:

The result tree starts with the split on income variable, my second explanatory variable.

This binary variable has values of zero (0) representing income level less than or equals the mean and value one (1) representing income greater than the mean.

In the first split we can see that 26 countries have the life expectancy and income levels greater than the mean and the other 76 countries have the life expectancy less than the mean.

The second split , splits in the other the nodes according to consumption alcohol levels and so on.

we can see that the majority of countries with the life expectancy greater than the mean has the alcohol consumption between 2.5 and 3.5 liters per year

Code:

import pandas as pd import numpy as np from collections import OrderedDict import matplotlib.pyplot as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics

from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus import itertools

# Variables Descriptions INCOME = “2010 Gross Domestic Product per capita in constant 2000 US$” ALCOHOL = “2008 alcohol consumption (litres, age 15+)” LIFE = “2011 life expectancy at birth (years)”

# bug fix for display formats to avoid run time errors pd.set_option(‘display.float_format’, lambda x:’%f’%x)

# Load from CSV data = pd.read_csv('gapminder.csv’, skip_blank_lines=True, usecols=['country’,'incomeperperson’, 'alcconsumption’,'lifeexpectancy’])

data.columns = ['country’,'income’,'alcohol’,'life’]

# converting to numeric values and parsing (numeric invalids=NaN) # convert variables to numeric format using convert_objects function data['alcohol’]=pd.to_numeric(data['alcohol’],errors='coerce’) data['income’]=pd.to_numeric(data['income’],errors='coerce’) data['life’]=pd.to_numeric(data['life’],errors='coerce’)

# Remove rows with nan values data = data.dropna(axis=0, how='any’)

# Copy dataframe for preserve original data1 = data.copy()

# Mean, Min and Max of life expectancy# Mean, meal = data1.life.mean() minl = data1.life.min() maxl = data1.life.max()

# Create categorical response variable life (Two levels based on mean) data1['life’] = pd.cut(data.life,[np.floor(minl),meal,np.ceil(maxl)], labels=[’<=69’,’>69’]) data1['life’] = data1['life’].astype('category’)

# Mean, Min and Max of alcohol meaa = data1.alcohol.mean() mina = data1.alcohol.min() maxa = data1.alcohol.max()

# Categoriacal explanatory variable (Two levels based on mean) data1['alcohol’] = pd.cut(data.alcohol,[np.floor(mina),meaa,np.ceil(maxa)], labels=[0,1])

cat1 = pd.cut(data.alcohol,5).cat.categories data1[“alcohol”] = pd.cut(data.alcohol,5,labels=['0’,'1’,'2’,'3’,'4’]) data1[“alcohol”] = data1[“alcohol”].astype('category’)

# Mean, Min and Max of income meai = data1.income.mean() mini = data1.income.min() maxi = data1.income.max()

# Categoriacal explanatory variable (Two levels based on mean) data1['income’] = pd.cut(data.income,[np.floor(mini),meai,np.ceil(maxi)], labels=[0,1]) data1[“income”] = data1[“income”].astype('category’)

# convert variables to numeric format using convert_objects function data1['alcohol’]=pd.to_numeric(data1['alcohol’],errors='coerce’) data1['income’]=pd.to_numeric(data1['income’],errors='coerce’)

data1 = data1.dropna(axis=0, how='any’)

predictors = data1[['alcohol’, 'income’]] targets = data1.life pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

#Build model on training data clf = DecisionTreeClassifier() clf = clf.fit(pred_train,tar_train)

predictions=clf.predict(pred_test)

accuracy = sklearn.metrics.accuracy_score(tar_test, predictions) print ('Accuracy Score: ’, accuracy,’\n’)

#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(clf, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())

0 notes