drsuthar
drsuthar
Untitled
19 posts
Don't wanna be here? Send us removal request.
drsuthar · 7 years ago
Text
Childhood Obesity and Social Deprivation in England
Test a Logistic Regression ModelDataset:
2015 childhood obesity, various measures of social deprivation, and income in England. Research topic: I have chosen to investigate the association between childhood obesity and absolute social deprivation as measured by the IMD and the other indices of social deprivation, and by GDHI. Code 📓: https://www.dropbox.com/s/r46h2imc95m9ycm/Childhood%20Obesity%20and%20Social%20Deprivation%20Study_V01a.docx?dl=0 This week’s Python code on GitHub: https://github.com/rusmat0173/WesleyanDataMOOC/blob/master/WeslyanDataMOOC_Jupyter_notebook.ipynb
Week Four activities:
> Task: Write a blog entry that summarizes in a few sentences 1) what you found in your multiple regression analysis. Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary. 2) Report whether your results supported your hypothesis for the association between your primary explanatory variable and the response variable. 3) Discuss whether there was evidence of confounding for the association between your primary explanatory and response variable.All the explanatory variables in the working dataframe are quantitative and data is cleaned up. Initial task is to create a bespoke dataframe just of the explanatory variables, to later do a check of correlation between theses variables, to try to minimise multicollinearity.
Data Notes
I need to create a binary response variable. I will use ‘Crime - Average score’ as the basis for binary response variable but it needs to be converted from a quantitive variable. To make more 'binarily’ interesting, I will transform the highest quartile of 'Crime - Average score’ into a High Crime = 1 categoric variable. The lower 3 quartiles of 'Crime - Average score’ will be High Crime = 0.
Firstly I want to create another (slimmer) dataframe to work on, and take out all the spaces, commas and dashes in string names. Finally check by printing headers.
Tumblr media
Now use 'CrimeAverageScore’ to create a binary categoric variable as described above.
Tumblr media
Now test whether binary categorisation works properly.
Tumblr media
OK, that all works and we have a binary response variable 'CrimeCat’. (cutoff is given as 0.483, above.)
Logistic Regression
Now to choose initial explanatory variables: I’ll choose ['GDHI’, 'IncomeAveragescore’,'EmploymentAveragescore’,'LivingEnvironmentAveragescore’] as initial list, in that order.
Tumblr media Tumblr media
Result very interesting: coefficient is very low, p-value very high and odds ratios nearly = 1 (with 95% conidence interval). This shows that there is almost no association between GDHI (Gross Domestic Household Income in an English local authority area) and the high crime binary categorical variable in a local authority area.
So a hypothesis of a link between high Crime and low income (GDHI) is not at all supported.
So repeat this with 'IncomeAveragescore’.
Tumblr media
Result also very interesting! Coefficient is very high at nearly 36, with a 95% C.I. between 25.5 and 46. The p-value is very low and odds ratios huge at ~3.5e+15. This shows that there is a very high association between IncomeAveragescore and the high crime binary categorical variable in a local authority area. A 1 unit increase in Income deprivation increases the odds of being in the high crime category by 35 times.
So a hypothesis of a link between high Crime and high Income deprivation is  well supported.
Now add 'EmploymentAveragescore’ to test if a confounding variable.
Tumblr media Tumblr media
Adding 'EmploymentAverageScore’ has improved the Pseudo R-squared coefficient to 55% from 34% for the previous model. The 'EmploymentAverageScore’ is statistically siginificant on its own. Let’s initially look at impact on already-in 'IncomeAveragescore’ variable. This is still statistically significant (p-value <0.001) and the coefficient is higher. Its odds ratio (controlling for 'EmploymentAverageScore’) is even higher at 5.6e+49. Summary: 'EmploymentAverageScore’ is a confounding variable as it has significantly changed the parameters for 'IncomeAveragescore’ whilst being very significant itself. Notably the coefficient is strongly negative. That says ~“as Employment deprivation gets worse, the chances of being in a high crime area are lower.” This is somewhat counter-intuitive, but does tie-in with the increase in postive odds ratio (& coefficient) for 'IncomeAveragescore’.
Further logistic regressions
Adding 'LivingEnvironmentAveragescore’ as quantitative variable.
Tumblr media Tumblr media
Adding 'LivingEnvironmentAveragescore’ has had no impact on Pseudo R-squared, is not statistically significant (p-value of 0.527) and has an odds ratio ~= 1 (at 0.963). So this is not a confounding variable and does not improve the model.
I repeated this analysis also for BarrierstoHousingandServicesAveragescore. The result was very similar, also statistically insiginificant. (Same when including either of 'EducationSkillsandTrainingAveragescore’ and 'HealthDeprivationandDisabilityAveragescore’.)
Summary: So essentially the model of explaining binary categorical variable CrimeCat only works for 'IncomeAveragescore’ and'EmploymentAveragescore’, though the latter (counter-intuitively) in a negative sense.
0 notes
drsuthar · 7 years ago
Text
Week 3
Tumblr media Tumblr media Tumblr media Tumblr media
0 notes
drsuthar · 7 years ago
Text
Week 3
import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV
#Load the dataset nsc = pd.read_csv(“nesarc_pds.csv”, low_memory=False)
#Data management and cleaning nsc.columns = map(str.upper, nsc.columns) names = [‘AGE’,‘S4AQ7’,'SEX’,'S4AQ6A’,'S2BQ2D’,'S2BQ2E’,'S2BQ2FR’,'S2BQ3A’,'S2BQ3B’,'S2BQ3CR’,'ETHRACE2A’,'S1Q5A’] nsc = nsc[names] nsc = nsc.copy() for col in names: nsc[col] = pd.to_numeric(nsc[col],errors='coerce’)
nsc['S4AQ7’] = nsc['S4AQ7’].replace(99,np.nan) nsc['S4AQ6A’] = nsc['S4AQ6A’].replace(99,np.nan) nsc['S2BQ2D’] = nsc['S2BQ2D’].replace(99,np.nan) nsc['S2BQ2E’] = nsc['S2BQ2E’].replace(99,np.nan) nsc['S2BQ2FR’] = nsc['S2BQ2FR’].replace(99,np.nan) nsc['S2BQ3A’] = nsc['S2BQ3A’].replace(99,np.nan) nsc['S2BQ3B’] = nsc['S2BQ3B’].replace(99,np.nan) nsc['S2BQ3CR’] = nsc['S2BQ3CR’].replace(999,np.nan) nsc['S1Q5A’] = nsc['S1Q5A’].replace(99,np.nan) nsc['SEX’] = nsc['SEX’].astype('category’) nsc['ETHRACE2A’] = nsc['ETHRACE2A’].astype('category’) clean = nsc.dropna() n = clean[(clean['AGE’]>25)] # Subsetting for people older than 25 years old
# Variables of interest “”“ S4AQ7 NUMBER OF EPISODES (DEPRESSION) SEX S4AQ6A AGE AT ONSET OF FIRST EPISODE S2BQ2D AGE AT ONSET OF ALCOHOL DEPENDENCE S2BQ2E NUMBER OF EPISODES OF ALCOHOL DEPENDENCE S2BQ2FR DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL DEPENDENCE (BASED ON S2BQ2H IF ONLY 1 EPISODE) S2BQ3A AGE AT ONSET OF ALCOHOL ABUSE S2BQ3B NUMBER OF EPISODES OF ALCOHOL ABUSE S2BQ3CR DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL ABUSE ETHRACE2A IMPUTED RACE/ETHNICITY S1Q5A NUMBER OF CHILDREN EVER HAD ”“”
#select predictor variables and target variable as separate data sets   exp = n[['SEX’,'S4AQ6A’,'S2BQ2D’,'S2BQ2E’,'S2BQ2FR’,'S2BQ3A’,'S2BQ3B’,'S2BQ3CR’,'ETHRACE2A’,'S1Q5A’]] dep = n.S4AQ7
# standardize predictors to have mean=0 and sd=1 predictors=exp.copy() from sklearn import preprocessing predictors['S4AQ6A’]=preprocessing.scale(predictors['S4AQ6A’].astype('float64’)) predictors['S2BQ2D’]=preprocessing.scale(predictors['S2BQ2D’].astype('float64’)) predictors['S2BQ2FR’]=preprocessing.scale(predictors['S2BQ2FR’].astype('float64’)) predictors['S2BQ3A’]=preprocessing.scale(predictors['S2BQ3A’].astype('float64’)) predictors['S2BQ2E’]=preprocessing.scale(predictors['S2BQ2E’].astype('float64’)) predictors['S2BQ3CR’]=preprocessing.scale(predictors['S2BQ3CR’].astype('float64’)) predictors['S1Q5A’]=preprocessing.scale(predictors['S1Q5A’].astype('float64’)) predictors['S2BQ3B’]=preprocessing.scale(predictors['S2BQ3B’].astype('float64’))
# split data into train and test sets exp_train, exp_test, dep_train, dep_test = train_test_split(predictors, dep,                                                             test_size=.3, random_state=123)
# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(exp_train,dep_train)
# print variable names and regression coefficients print(“Regression coefficients:\n”,dict(zip(predictors.columns, model.coef_)))
# plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle=’–’, color='k’,           label='alpha CV’) plt.ylabel('Regression Coefficients’) plt.xlabel(’-log(alpha)’) plt.title('Regression Coefficients Progression for Lasso Paths’)
# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ’:’) plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k’,        label='Average across the folds’, linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle=’–’, color='k’,           label='alpha CV’) plt.legend() plt.xlabel(’-log(alpha)’) plt.ylabel('Mean squared error’) plt.title('Mean squared error on each fold’)
# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(dep_train, model.predict(exp_train)) test_error = mean_squared_error(dep_test, model.predict(exp_test)) print ('training data MSE’) print(train_error) print ('test data MSE’) print(test_error)
# R-square from training and test data rsquared_train=model.score(exp_train,dep_train) rsquared_test=model.score(exp_test,dep_test) print ('training data R-square’) print(rsquared_train) print ('test data R-square’) print(rsquared_test)
Output:
Regression coefficients: {'SEX’: 0.0, 'S4AQ6A’: -3.3549670327445114, 'S2BQ2D’: 0.3905379532363508, 'S2BQ2E’: 0.974223803746794, 'S2BQ2FR’: 0.0, 'S2BQ3A’: 0.4900861074604128, 'S2BQ3B’: 2.430554111129614, 'S2BQ3CR’: 0.8887820963998001, 'ETHRACE2A’: -0.043225269741326254, 'S1Q5A’: 0.2994884387283313}
training data MSE 263.10125855955783
test data MSE 211.5459199323467
training data R-square 0.08476097792937896
test data R-square 0.10089940145604315
Tumblr media Tumblr media
Data is subsetted for people above age 25.
SEX and DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL DEPENDENCE have insignificant influence on NUMBER OF EPISODES (DEPRESSION), and thus are given weight of 0 by the lasso regression model. S4AQ6A - AGE AT ONSET OF FIRST EPISODE has negative impact on NUMBER OF EPISODES (DEPRESSION). The lower the age at onset, the higher the number of episodes of depression. S2BQ3B NUMBER OF EPISODES OF ALCOHOL ABUSE is positively related to the dependent variable.
R squared is 8.5% i.e. the fitted model explains 8.5% of variation in the dependent variable. R squared on test data is 10%. This exhibits lasso regression’s ability to generalize well on unseen data.
Regression coefficients increase in size and reach a saturation along the path. Insignificant variables are given a coefficient of 0. Mean square error for each fold regresses toward the average mean squared error over time. Average MSE drops and then stays constant along the path.
0 notes
drsuthar · 7 years ago
Text
Week 4
from pandas import DataFrame, Series import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans
# Data Management nsc = pd.read_csv(“nesarc_pds.csv”, low_memory=False) nsc[‘S2AQ20’] = nsc[‘S2AQ20’].replace(99,np.nan) nsc['S2BQ2D’] = nsc['S2BQ2D’].replace(99,np.nan) nsc['S2BQ2E’] = nsc['S2BQ2E’].replace(99,np.nan) nsc['S2BQ2FR’] = nsc['S2BQ2FR’].replace(999,np.nan) nsc['S4AQ7’] = nsc['S4AQ7’].replace(99,np.nan) nsc['S4AQ9DR’] = nsc['S4AQ9DR’].replace([9997,9998,9999],np.nan) nsc['ETOTLCA2’] = pd.to_numeric(nsc['ETOTLCA2’], errors='coerce’) var = [“S2AQ20”,“S2BQ2D”,“S2BQ2E”,“S2BQ2FR”,“S4AQ7”,“S4AQ9DR”,“ETOTLCA2”] var1 = [“S2AQ20”,“S2BQ2D”,“S2BQ2E”,“S2BQ2FR”,“S4AQ7”,“S4AQ9DR”,“ETOTLCA2”,“AGE”]
for i in var1:   nsc[i] = pd.to_numeric(nsc[i], errors='coerce’)
c1 = nsc[var1] c1 = c1.dropna() c = nsc[var] c = c.dropna() cpy = c.copy()
# Scaling for j in var:   cpy[j]=preprocessing.scale(cpy[j].astype('float64’))
“”“ Clustering Variables S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING
S2BQ2D AGE AT ONSET OF ALCOHOL DEPENDENCE
S2BQ2E NUMBER OF EPISODES OF ALCOHOL DEPENDENCE
S2BQ2FR DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL DEPENDENCE
S4AQ7 NUMBER OF EPISODES (DEPRESSION)
S4AQ9DR DURATION (WEEKS) OF ONLY/LONGEST EPISODE (DEPRESSION)
ETOTLCA2 AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL TYPES OF ALCOHOLIC BEVERAGES COMBINED ”“” # train and test split train, test = train_test_split(cpy, test_size=.3, random_state=123)
# k-means clustering from scipy.spatial.distance import cdist clusters = range(1,20) meandist = []
for k in clusters:   model = KMeans(n_clusters=k)   model.fit(train)   cls = model.predict(train)   meandist.append(sum(np.min(cdist(train, model.cluster_centers_, 'euclidean’), axis=1))   / train.shape[0])
“”“ Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose ”“”
plt.plot(clusters, meandist) plt.xlabel('Number of clusters’) plt.ylabel('Average distance’) plt.title('Selecting k with the Elbow Method’)  
# Interpret 6 cluster solution model6=KMeans(n_clusters=6) model6.fit(train) clusassign=model6.predict(train)
# plot clusters from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model6.labels_,) plt.xlabel('Canonical variable 1’) plt.ylabel('Canonical variable 2’) plt.title('Scatterplot of Canonical Variables for 6 Clusters’) plt.show()
“”“ BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster ”“” # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(train['index’]) # create a list of cluster assignments labels=list(model6.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index’) newclus # rename the cluster assignment column newclus.columns = ['cluster’]
# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pd.merge(train, newclus, on='index’) merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()
“”“ END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster ”“”
# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster’).mean() print (“Clustering variable means by cluster”) print(clustergrp)
# validate clusters in training data by examining cluster differences in AGE using ANOVA # first have to merge AGE with clustering variables and cluster assignment data data=c1['AGE’] # split GPA data into train and test sets n_train, n_test = train_test_split(data, test_size=.3, random_state=123) n_train1=pd.DataFrame(n_train) n_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(n_train1, merged_train, on='index’) sub1 = merged_train_all[['AGE’, 'cluster’]].dropna()
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
gpamod = smf.ols(formula='AGE ~ C(cluster)’, data=sub1).fit() print (gpamod.summary())
print ('means for  by cluster’) m1= sub1.groupby('cluster’).mean() print (m1)
print ('standard deviations for  by cluster’) m2= sub1.groupby('cluster’).std() print (m2)
mc1 = multi.MultiComparison(sub1['AGE’], sub1['cluster’]) res1 = mc1.tukeyhsd() print(res1.summary())
Tumblr media Tumblr media
Clustering variable means by cluster               index    S2AQ20    S2BQ2D    S2BQ2E   S2BQ2FR     S4AQ7  \ cluster                                                                   0        20519.575910 -0.123316 -0.396928 -0.214280 -0.164058 -0.276760   1        21446.183575  0.005736  1.371894 -0.224220 -0.249446 -0.265347   2        18984.911765  1.091913  0.572542  1.876836  5.140055  0.717593   3        19319.125000  1.332112  0.452860  0.550371  0.748134  1.875452   4        18855.412698  0.035524  0.091741 -0.222021 -0.175676  3.286241   5        18604.514286  0.020961  0.060241  4.531810 -0.114797  0.664498  
        S4AQ9DR  ETOTLCA2   cluster                       0       -0.154849 -0.075416   1       -0.147045  0.229559   2       -0.126325  0.013964   3        6.630989 -0.088925   4       -0.085770 -0.027003   5       -0.134659  0.014416  
OLS Regression Results                             ============================================================================== Dep. Variable:                    AGE   R-squared:                       0.156 Model:                            OLS   Adj. R-squared:                  0.152 Method:                 Least Squares   F-statistic:                     42.61 Date:                Tue, 22 May 2018   Prob (F-statistic):           2.29e-40 Time:                        17:55:46   Log-Likelihood:                -4492.7 No. Observations:                1160   AIC:                             8997. Df Residuals:                    1154   BIC:                             9028. Df Model:                           5                                         Covariance Type:            nonrobust                                         ===================================================================================                     coef    std err          t      P>|t|      [0.025      0.975] ———————————————————————————– Intercept          34.6186      0.413     83.774      0.000      33.808      35.429 C(cluster)[T.1]    12.8066      0.910     14.072      0.000      11.021      14.592 C(cluster)[T.2]     7.2932      2.043      3.570      0.000       3.285      11.302 C(cluster)[T.3]     8.0481      2.417      3.330      0.001       3.306      12.790 C(cluster)[T.4]     6.0322      1.527      3.951      0.000       3.037       9.028 C(cluster)[T.5]     4.8671      2.015      2.416      0.016       0.914       8.820 ============================================================================== Omnibus:                       87.652   Durbin-Watson:                   1.892 Prob(Omnibus):                  0.000   Jarque-Bera (JB):              106.462 Skew:                           0.706   Prob(JB):                     7.62e-24 Kurtosis:                       3.460   Cond. No.                         7.24 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for  by cluster              AGE cluster           0        34.618570 1        47.425121 2        41.911765 3        42.666667 4        40.650794 5        39.485714 standard deviations for  by cluster              AGE cluster           0        11.702770 1        10.248511 2        14.577624 3        13.143577 4        13.203322 5        11.647700 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================== group1 group2 meandiff  lower    upper  reject ———————————————- 0      1    12.8066  10.2087  15.4044  True 0      2     7.2932   1.4616  13.1248  True 0      3     8.0481   1.149   14.9472  True 0      4     6.0322   1.674   10.3904  True 0      5     4.8671   -0.884  10.6183 False 1      2    -5.5134  -11.6756  0.6489 False 1      3    -4.7585  -11.9392  2.4223 False 1      4    -6.7743  -11.566  -1.9827  True 1      5    -7.9394  -14.0256 -1.8532  True 2      3     0.7549  -8.1233   9.6331 False 2      4     -1.261  -8.3475   5.8255 False 2      5    -2.4261  -10.4448  5.5927 False 3      4    -2.0159  -10.0039  5.9722 False 3      5     -3.181  -12.0065  5.6446 False 4      5    -1.1651  -8.1855   5.8554 False ———————————————-
Summary:
Clustering was done using the following variables:
S2AQ20 DURATION (YEARS) OF PERIOD OF HEAVIEST DRINKING
S2BQ2D AGE AT ONSET OF ALCOHOL DEPENDENCE
S2BQ2E NUMBER OF EPISODES OF ALCOHOL DEPENDENCE
S2BQ2FR DURATION (MONTHS) OF LONGEST/ONLY EPISODE OF ALCOHOL DEPENDENCE
S4AQ7 NUMBER OF EPISODES (DEPRESSION)
S4AQ9DR DURATION (WEEKS) OF ONLY/LONGEST EPISODE (DEPRESSION)
ETOTLCA2 AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL TYPES OF ALCOHOLIC BEVERAGES COMBINED
Elbow curve suggests segmenting data into 6 clusters.
Two-dimensional representation of the 6 clusters gives two closely packed clusters and 4 loose clusters
Performing ANOVA to examine cluster differences in age reveals a significant relationship between cluster number and age as the p-value is 2.29e-40 (<.05)
HSD post-hoc test gives significant mean differences of age between clusters
0-1, 0-2,0-3,0-4,1-4,1-5
0 notes
drsuthar · 7 years ago
Text
Week 2: Random Forest
import pandas as pd import numpy as np import statsmodels.api as sm from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from pandas import DataFrame, Series import matplotlib.pylab as plt from sklearn.metrics import classification_report import sklearn.metrics
# Feature importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier
# Loading data nsc = pd.read_csv(“nesarc_pds.csv”,low_memory=False) nsc.dtypes
# converting all working variables to numeric for i in [‘AGE’,‘SEX’,'MARITAL’,'S2BQ1A17’,'ALCABDEP12DX’,'TAB12MDX’]: nsc[i] = pd.to_numeric(nsc[i],errors='coerce’)
nsc['S2BQ1A17’] = nsc['S2BQ1A17’].replace(9,np.nan) clean = nsc.dropna() exp = clean[['AGE’,'SEX’,'MARITAL’,'S2BQ1A17’,'ALCABDEP12DX’,'TAB12MDX’]] dep = clean.MAJORDEP12 exp_train,exp_test,dep_train,dep_test = train_test_split(exp, dep, test_size=.3) print(exp_train.shape) print(exp_test.shape) print(dep_train.shape) print(dep_test.shape)
# Building model from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators=20) classifier = classifier.fit(exp_train,dep_train) predictions = classifier.predict(exp_test)
print(“\nConfusion matrix: \n”,sklearn.metrics.confusion_matrix(dep_test,predictions)) print(“\nAccruracy = ”,sklearn.metrics.accuracy_score(dep_test,predictions)*100,“%”)
# Extra trees model extrees = ExtraTreesClassifier() extrees.fit(exp_train,dep_train) # relative importance of each variable print(“Relative importance of variable: \n”,extrees.feature_importances_,“ in order \n”,exp.columns)
# Testing accuracy on varying number of trees trees = range(20) acc = np.zeros(20) for idx in range(20): classifier = RandomForestClassifier(n_estimators=idx+1) classifier = classifier.fit(exp_train,dep_train) preds = classifier.predict(exp_test) acc[idx] = sklearn.metrics.accuracy_score(dep_test,preds)
plt.cla()     plt.plot(trees,acc)    
Output:(342, 6)
(147, 6)
(342,)
(147,)
Confusion matrix:
[[94 13]
[31  9]]
Accruracy =  70.06802721088435 %
Relative importance of variable:
[0.42760142 0.05358502 0.23470483 0.0770946  0.15433372 0.05268041]  in order
Index(['AGE’, 'SEX’, 'MARITAL’, 'S2BQ1A17’, 'ALCABDEP12DX’, 'TAB12MDX’], dtype='object’)
Out[241]: [<matplotlib.lines.Line2D at 0x7f1d257ae240>]
Tumblr media
Summary:
Accuracy of our fitted random forest model with 20 trees is around 70%, which is over 5% better than what we got with decision tree. Thus we can conclude that random forest model is capable of surpassing a decision tree model in correctly predicting classes.
Descending order of importance of explanatory variables on dependent variable(major depression in last 12 months) :
AGE>MARITAL STATUS>ALCOHOL DEPENDENCE LAST 12 MONTHS>EVER CONTINUE TO DRINK EVEN THOUGH CAUSING HEALTH PROBLEM>SEX>NICOTINE DEPENDENCE LAST 12 MONTHS
It was observed that increasing number of trees in the forest initially increases accuracy but further increments cause volatility in accuracy scores and there is no guarantee of increase in accuracy
0 notes
drsuthar · 7 years ago
Text
Week 1:Machine Learning - Decision Trees
This is the first task of the Machine Learning Course.
Here are my variables:
Income , which is an Explanatory Variable Alcohol, also an Explanatory Variable Life, which is a Response Variable
Decision Tree
This is how the decision tree looks like :
Tumblr media
Interpretation:
The result tree starts with the split on income variable, my second explanatory variable.
This binary variable has values of zero (0) representing income level less than or equals the mean  and value one (1) representing income greater than the mean.
In the first split we can see that 26 countries have the life expectancy and income levels greater than the mean and the other 76 countries have the life expectancy less than the mean.  
The second split , splits in the other the nodes according to consumption alcohol levels and so on.
we can see that the majority of countries with the life expectancy greater than the mean  has the alcohol consumption between 2.5 and 3.5 liters per year
Code:
import pandas as pd import numpy as np from collections import OrderedDict import matplotlib.pyplot as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics
from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus import itertools
# Variables Descriptions INCOME = “2010 Gross Domestic Product per capita in constant 2000 US$” ALCOHOL = “2008 alcohol consumption (litres, age 15+)” LIFE = “2011 life expectancy at birth (years)”
# bug fix for display formats to avoid run time errors pd.set_option(‘display.float_format’, lambda x:’%f’%x)
# Load from CSV data = pd.read_csv('gapminder.csv’, skip_blank_lines=True,                    usecols=['country’,'incomeperperson’,                             'alcconsumption’,'lifeexpectancy’])
data.columns = ['country’,'income’,'alcohol’,'life’]
# converting to numeric values and parsing (numeric invalids=NaN) # convert variables to numeric format using convert_objects function data['alcohol’]=pd.to_numeric(data['alcohol’],errors='coerce’) data['income’]=pd.to_numeric(data['income’],errors='coerce’) data['life’]=pd.to_numeric(data['life’],errors='coerce’)
# Remove rows with nan values data = data.dropna(axis=0, how='any’)
# Copy dataframe for preserve original data1 = data.copy()
# Mean, Min and Max of life expectancy# Mean, meal = data1.life.mean() minl = data1.life.min() maxl = data1.life.max()
# Create categorical response variable life (Two levels based on mean) data1['life’] = pd.cut(data.life,[np.floor(minl),meal,np.ceil(maxl)], labels=[’<=69’,’>69’]) data1['life’] = data1['life’].astype('category’)
# Mean, Min and Max of alcohol meaa = data1.alcohol.mean() mina = data1.alcohol.min() maxa = data1.alcohol.max()
# Categoriacal explanatory variable (Two levels based on mean) data1['alcohol’] = pd.cut(data.alcohol,[np.floor(mina),meaa,np.ceil(maxa)],                         labels=[0,1])
cat1 = pd.cut(data.alcohol,5).cat.categories data1[“alcohol”] = pd.cut(data.alcohol,5,labels=['0’,'1’,'2’,'3’,'4’]) data1[“alcohol”] = data1[“alcohol”].astype('category’)
# Mean, Min and Max of income meai = data1.income.mean() mini = data1.income.min() maxi = data1.income.max()
# Categoriacal explanatory variable (Two levels based on mean) data1['income’] = pd.cut(data.income,[np.floor(mini),meai,np.ceil(maxi)],                         labels=[0,1]) data1[“income”] = data1[“income”].astype('category’)
# convert variables to numeric format using convert_objects function data1['alcohol’]=pd.to_numeric(data1['alcohol’],errors='coerce’) data1['income’]=pd.to_numeric(data1['income’],errors='coerce’)
data1 = data1.dropna(axis=0, how='any’)
predictors = data1[['alcohol’, 'income’]] targets = data1.life pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)
#Build model on training data clf = DecisionTreeClassifier() clf = clf.fit(pred_train,tar_train)
predictions=clf.predict(pred_test)
accuracy = sklearn.metrics.accuracy_score(tar_test, predictions) print ('Accuracy Score: ’, accuracy,’\n’)
#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(clf, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())
0 notes
drsuthar · 7 years ago
Text
Week 2 Submission
Tumblr media Tumblr media
0 notes
drsuthar · 7 years ago
Text
Week - 1 Submission
Tumblr media Tumblr media
0 notes
drsuthar · 7 years ago
Text
Week 4: Correlation coefficient with Moderation
Tumblr media Tumblr media
0 notes
drsuthar · 7 years ago
Text
Week 2: Chi square test
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
0 notes
drsuthar · 7 years ago
Text
Week 1: ANOVA Analysis between suicide rate and alcohol consumption
Tumblr media Tumblr media Tumblr media
0 notes
drsuthar · 7 years ago
Text
Week 3 : Generating a Correlation Coefficient
Tumblr media Tumblr media
0 notes
drsuthar · 7 years ago
Text
Week - 1: ANOVA Analysis
Therefore, we have enough evidence to reject the null hypothesis (H) and accept the alternate hypothesis (H).
Tumblr media Tumblr media Tumblr media
This ANOVA analysis shows the dependency of suicide rates on income per person and alcohol consumption.
p values of both analysis is less than 0.05 . Therefore, I have enough evidence to reject the null hypothesis (H) and accept the alternate hypothesis (H).
As a result, ANOVA revealed that suicide rate and income per person and alcohol consumption  are significantly associated.
0 notes
drsuthar · 7 years ago
Text
Week - 1 : Decision Tree using C4.5 Pruning
Tumblr media Tumblr media Tumblr media
0 notes
drsuthar · 7 years ago
Text
Week 4 Assignment
Tumblr media Tumblr media Tumblr media Tumblr media
Week 4 program’s result can be accessed here.
https://www.dropbox.com/s/4xcwjqv21ccu0qk/week4-results.pdf?dl=0
I have kept pdf file of result.
0 notes
drsuthar · 7 years ago
Text
Week 3 Assignment
Tumblr media Tumblr media
 Programs are
Tumblr media Tumblr media
Thanks for reviewing.
0 notes
drsuthar · 7 years ago
Text
Correlation between Suicide rate, Income per person, and Alcohol Consumption.
After looking through all the codebooks, I have selected Gapminder data to study the any positive or negative correlation between Suicide rate, Income per person, and Alcohol Consumption. I have selected following columns which are my importance from the codebook [3].
Unique Id: Country – Name of country
Tumblr media
Research shows that an additional liter of ethanol from total alcohol sales was estimated to increase suicide rates by 2.3% [1]. Similarly I have noticed around me that people who consume alcohol are mostly in depression and they have sometimes suicidal thoughts so that might be the reason for positive correlation between alcohol consumption and suicide rate. Similarly income has negative correlation with the suicide rate [2].
Based on literature review I think that alcohol consumption and suicide rate will be positively related and income and suicide rate will be negatively related.
Reference:
[1] Kerr, W. C., Subbaraman, M., & Ye, Y. (2011). Per capita alcohol consumption and suicide mortality in a panel of US states from 1950 to 2002. Drug and Alcohol Review,30(5), 473-480. doi:10.1111/j.1465-3362.2011.00306.x
[2] Simon, J. L. (1968). The Effect of Income on the Suicide Rate: A Paradox Resolved. American Journal of Sociology,74(3), 302-303. doi:10.1086/224644
[3] Retrieved from https://www.gapminder.org/
Word image of the same is,
Tumblr media
0 notes