Don't wanna be here? Send us removal request.
Text
k-means cluster analysis
Below is the k-means cluster analysis I completed on family history of alcoholism using data from the NESARC dataset. I choose the following clustering variables Alcoholic father, mother, brother, sister, son, daughter, married to an alcoholic and currently living with an alcoholic. The first part is the k-means cluster analysis with a range set at (1,10) 1-9 clusters using the elbow method. Below is the plot displaying this output. Looking at the plot you will see the curve between clusters 2 and 4 appear as if it is averaging out but then there is another curve between clusters 5 and 7 where it can be averaging out as well.
The next part of the analysis was running a canonical analysis in this plot only 3 clusters were displayed KMeans(n_clusters=3). This includes the 2 canonical variables. These are displayed in order by the proportion of variance. You can see how the dark blue cluster is for the most part packed low in the variance but has a few spread out points. The yellow is higher in the variance and not as packed together and the green is the highest with only 2 points but the highest.
Using analysis of variance to check if there are significant differences between clusters. Drink in the last 12 months is the response variable and cluster is the explanatory variable. Looking at the analysis of variance table you will see the clusters did not differ significantly with drinking in last 12 months.
Clustering variable means by cluster index S2DQ1 S2DQ2 ... S2DQ6C2 S2DQ19 S2DQ21 cluster ... 0 22322.458333 3.798367 0.754710 ... 0.0 -0.249312 0.062684 1 20744.190332 -0.229293 -0.109190 ... 0.0 0.104510 0.014298 2 9929.000000 3.798367 5.894922 ... 0.0 -1.391897 0.062684
[3 rows x 9 columns] OLS Regression Results ============================================================================== Dep. Variable: S2AQ3 R-squared: 0.006 Model: OLS Adj. R-squared: 0.000 Method: Least Squares F-statistic: 1.020 Date: Thu, 11 Mar 2021 Prob (F-statistic): 0.362 Time: 09:09:47 Log-Likelihood: -58.230 No. Observations: 357 AIC: 122.5 Df Residuals: 354 BIC: 134.1 Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 1.1667 0.058 19.981 0.000 1.052 1.281 C(cluster)[T.1] -0.0821 0.060 -1.357 0.176 -0.201 0.037 C(cluster)[T.2] -0.1667 0.211 -0.792 0.429 -0.581 0.247 ============================================================================== Omnibus: 219.273 Durbin-Watson: 2.003 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1055.340 Skew: 2.850 Prob(JB): 6.85e-230 Kurtosis: 9.201 Cond. No. 19.3 ==============================================================================
means for Drink in last 12 months by cluster S2AQ3 cluster 0 1.166667 1 1.084592 2 1.000000 standard deviations for Drink in last 12 months by cluster S2AQ3 cluster 0 0.380693 1 0.278695 2 0.000000 Multiple Comparison of Means - Tukey HSD, FWER=0.05 =================================================== group1 group2 meandiff p-adj lower upper reject --------------------------------------------------- 0 1 -0.0821 0.3657 -0.2244 0.0603 False 0 2 -0.1667 0.6912 -0.6622 0.3288 False 1 2 -0.0846 0.9 -0.5621 0.3929 False ---------------------------------------------------
Code: # -*- coding: utf-8 -*- """ Created on Tue Mar 9 14:03:32 2021
@author: Justin
"""
from pandas import Series, DataFrame
import pandas as pd import numpy as np import matplotlib.pylab as plt #from sklearn.cross_validation import train_test_split from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans
""" Data Management """ data = pd.read_csv("nesarc_pds.csv")
#upper-case all DataFrame column names data.columns = map(str.upper, data.columns)
# Data Management
data_clean = data.dropna()
cluster=data_clean[['S2DQ1', 'S2DQ2','S2DQ3C2', 'S2DQ4C2', 'S2DQ5C2','S2DQ6C2','S2DQ19', 'S2DQ21' ]]
cluster.describe() clustervar=cluster.copy() clustervar['S2DQ1']=preprocessing.scale(clustervar['S2DQ1'].astype('float64')) clustervar['S2DQ2']=preprocessing.scale(clustervar['S2DQ2'].astype('float64')) clustervar['S2DQ3C2']=preprocessing.scale(clustervar['S2DQ3C2'].astype('float64')) clustervar['S2DQ4C2']=preprocessing.scale(clustervar['S2DQ4C2'].astype('float64')) clustervar['S2DQ5C2']=preprocessing.scale(clustervar['S2DQ5C2'].astype('float64')) clustervar['S2DQ6C2']=preprocessing.scale(clustervar['S2DQ6C2'].astype('float64')) clustervar['S2DQ19']=preprocessing.scale(clustervar['S2DQ19'].astype('float64')) clustervar['S2DQ21']=preprocessing.scale(clustervar['S2DQ21'].astype('float64'))
# split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
# k-means cluster analysis for 1-9 clusters
from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]
for k in clusters:
model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')
model3=KMeans(n_clusters=3)
model3.fit(clus_train) clusassign=model3.predict(clus_train)
from sklearn.decomposition import PCA
pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()
clus_train.reset_index(level=0, inplace=True)
cluslist=list(clus_train['index']) labels=list(model3.labels_) newlist=dict(zip(cluslist, labels)) newlist
newclus=DataFrame.from_dict(newlist, orient='index')
newclus newclus.columns = ['cluster'] newclus.reset_index(level=0, inplace=True) merged_train=pd.merge(clus_train, newclus, on='index') merged_train.head(n=100) merged_train.cluster.value_counts() clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)
drink_data=data_clean['S2AQ3']
drink_train, gpa_test = train_test_split(drink_data, test_size=.3, random_state=123) drink_train1=pd.DataFrame(drink_train) drink_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(drink_train1, merged_train, on='index') sub1 = merged_train_all[['S2AQ3', 'cluster']].dropna()
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
drinkmod = smf.ols(formula='S2AQ3 ~ C(cluster)', data=sub1).fit()
print (drinkmod.summary())
print ('means for Drink in last 12 months by cluster')
means1= sub1.groupby('cluster').mean() print (means1)
print ('standard deviations for Drink in last 12 months by cluster')
means2= sub1.groupby('cluster').std() print (means2)
mc1 = multi.MultiComparison(sub1['S2AQ3'], sub1['cluster'])
res1 = mc1.tukeyhsd() print(res1.summary())
0 notes
Text
Lasso Regresson
Below is lasso regression I ran on family history and Alcoholism. I choose adoptive father as the target and for the predictor variables I choose natural family members (Mother, father, brother, sister, grandmother and grandfather). The code uses least angle regression and a k fold cross validation where (k=10) 10 random folds are used to choose the statistical model. Predictors with regression values of 0 after applying lasso penalty due to shrinkage were removed. Out of the 8 variables only 2 were selected in the final model, 6 of the variables were removed. The variables selected were natural Mother and full sisters.
S2DQ1 0.0
S2DQ3C2 0.0 S2DQ5C2 0.0 S2DQ6C2 0.0 S2DQ19 0.0 S2DQ21 0.0 S2DQ2 -0.004855304992687545 S2DQ4C2 -0.052779550407669074
None of the variables have a strong positive association with adoptive father but natural mother and full sister are negatively associated with adoptive father. Natural mother has a negative association of (-0.005) and full sister (-0.05). All of this is displayed on the Regression Coefficients Progression for Lasso Paths plot. In this plot you can also see when the variables were added with red being natural mother and orange being full sister.
The next plot is the Mean squared error on each fold plot. In this plot you can see there is an initial change in the mean square error. The variables do not all follow the same pattern as some quickly increase while others quickly decrease. You can see how quickly this plot levels off and adding more predictors does effect the mean squared error.
The test mean square error of 0.008 and training mean squared error of 0.37 is close indicating stability.
training data MSE 0.007921015697569204 test data MSE 0.3671288141656346
The Training data R-square value of 0.29 and the test R-squared value of 0.14. This means it explains 29% and 14% of the variable.
training data R-square 0.2850385767425656 test data R-square 0.13606823548436742
Code:
#Justin K
#NESARC Family History of Alcoholism
# -*- coding: utf-8 -*-
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
#os.chdir("C:\Trees")
#Data and Cleaning
data = pd.read_csv("nesarc_pds.csv")
#upper-case all DataFrame column names data.columns = map(str.upper, data.columns) data_clean = data.dropna()
data_clean.dtypes
data_clean.describe()
target = data_clean ['S2DQ14A']
#select predictor variables and target variable as separate data sets6 predvar= data_clean[['S2DQ1', 'S2DQ2','S2DQ3C2', 'S2DQ4C2', 'S2DQ5C2','S2DQ6C2','S2DQ19', 'S2DQ21' ]] # standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing predictors['S2DQ1']=preprocessing.scale(predictors['S2DQ1'].astype('float64')) predictors['S2DQ2']=preprocessing.scale(predictors['S2DQ2'].astype('float64')) predictors['S2DQ3C2']=preprocessing.scale(predictors['S2DQ3C2'].astype('float64')) predictors['S2DQ4C2']=preprocessing.scale(predictors['S2DQ4C2'].astype('float64')) predictors['S2DQ5C2']=preprocessing.scale(predictors['S2DQ5C2'].astype('float64')) predictors['S2DQ6C2']=preprocessing.scale(predictors['S2DQ6C2'].astype('float64')) predictors['S2DQ19']=preprocessing.scale(predictors['S2DQ19'].astype('float64')) predictors['S2DQ21']=preprocessing.scale(predictors['S2DQ21'].astype('float64'))
# split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3,random_state=123)
# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
res_dict = dict(zip(predictors.columns, model.coef_)) pred_dict = dict(sorted(res_dict.items(), key=lambda x: abs(x[1]))) print(pred_dict)
# plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.mse_path_, ':') plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')
# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)
# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)
0 notes
Text
Classification Tree
Summary:
Below is the classification tree I created using graphviz and pydotplus. As discussed in the class video due to tree’s being too large to easily understand and the inability to prune I only used 2 explanatory variables in this assignment as my original was too large. The training sample has 25,855 rows and 2 explanatory variables and the Test sample has 17,238 rows and 2 explanatory variables. This is using the target (S2AQ3 – “had a drink in the last 12 months”) The tree starts with a split on the 1st (x0 – Alcoholic Father) explanatory variable. If <= 1.5 , which is true, you would move to the left to my 2nd (x1 – Alcoholic Mother) explanatory variable, which is 4,952 out of the 25,855 samples. You can continue down the tree. Now if we go back to the first split of the 1st (x0 – Alcoholic Father) explanatory variable. If > 1.5, this would be false, you would move to the right to the 2nd (x1 – Alcoholic Mother) explanatory variable, which is 20,903 out of the 25,855 samples. Following the same procedure we can continue down this path of the decision tree.
Code:
#Justin K
#NESARC Family History of Alcoholism
# -*- coding: utf-8 -*-
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
os.chdir("C:\Trees")
#Data and Cleaning
data = pd.read_csv("nesarc_pds.csv")
data_clean = data.dropna()
data_clean.dtypes
data_clean.describe()
predictors = data_clean[['S2DQ1','S2DQ2']]
targets = data_clean['S2AQ3']
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
print (pred_train.shape)
print (pred_test.shape)
print (tar_train.shape)
print (tar_test.shape)
#Build model
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
#Displaying the decision tree
from sklearn import tree
from io import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydotplus
print(out.getvalue())
graph=pydotplus.graph_from_dot_data(out.getvalue())
# Image
Img=(Image(graph.create_png()))
display(Img)
Output:
(25855, 2)
(17238, 2)
(25855,)
(17238,)
0 notes
Text
Random Forest ** Classification tree assignment is in previous post below**
I performed a random forest analysis below. Like the decision tree I used the explanatory variables of alcoholic father, alcoholic mother, alcoholic brother and alcoholic sister. And if these have had drinks in the last 12 months. Alcoholic father had the highest importance score of 0.478 and Alcoholic brother had the least with 0.14. The accuracy of the random forest of 15 is 0.907 or rounded to 91%. Adding trees increased accuracy but only a small amount (approximately 2%) with 2 trees. Below is the code used to generate the random forest.
# -*- coding: utf-8 -*-
"""
Spyder Editor
"""
#Justin K
#NESARC Family History of Alcoholism
# -*- coding: utf-8 -*-
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
#os.chdir("C:\Trees")
#Data and Cleaning
AH_data = pd.read_csv("nesarc_pds.csv")
data_clean = AH_data.dropna()
data_clean.dtypes
data_clean.describe()
predictors = data_clean[['S2DQ1','S2DQ2','S2DQ3C2','S2DQ4C2' ]]
targets = data_clean['S2AQ3']
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
#Build model
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=15)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)
"""
Running a different number of trees and see the effect
of that on the accuracy of the prediction
"""
trees=range(15)
accuracy=np.zeros(15)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla()
plt.plot(trees, accuracy)
0 notes
Text
Classification Tree
Below is the classification tree I created using graphviz and pydotplus. The predictors used are mother, father, brother and sister with alcohol issues. The target is had a drink in the last 12 months. But due to the complexity of the tree I removed bother and sister to better explain the tree. I posted this tree below as well. Looking at the smaller tree there were 25,855 samples. If the split value (predictor of mother or father having alcohol issues) is less than 1.5 you move to the left (true). Since 1 is yes and 2 is no this means they have an issue. This has 4,958 samples. Now if your value was greater than 1.5 you would move to the right which is false. In this situation there are 20,897 samples. You can continue to follow the tree.
Below is the code used for the larger tree ( smaller one I just removed bother and sister)
# -*- coding: utf-8 -*-
"""
Spyder Editor
"""
#Justin K
#NESARC Family History of Alcoholism
# -*- coding: utf-8 -*-
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
os.chdir("C:\Trees")
#Data and Cleaning
AH_data = pd.read_csv("nesarc_pds.csv")
data_clean = AH_data.dropna()
data_clean.dtypes
data_clean.describe()
predictors = data_clean[['S2DQ1','S2DQ2','S2DQ3C2','S2DQ4C2' ]]
targets = data_clean['S2AQ3']
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
#Build model
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
#Displaying the decision tree
from sklearn import tree
from io import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydotplus
print(out.getvalue())
graph=pydotplus.graph_from_dot_data(out.getvalue())
# Image
Img=(Image(graph.create_png()))
display(Img)
Output:
digraph Tree {
node [shape=box] ;
0 [label="X[0] <= 1.5\ngini = 0.468\nsamples = 25855\nvalue = [16219, 9617, 19]"] ;
1 [label="X[2] <= 1.5\ngini = 0.434\nsamples = 4904\nvalue = [3343, 1560, 1]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="X[3] <= 1.5\ngini = 0.457\nsamples = 1488\nvalue = [962, 526, 0]"] ;
1 -> 2 ;
3 [label="X[1] <= 5.5\ngini = 0.475\nsamples = 334\nvalue = [204, 130, 0]"] ;
2 -> 3 ;
4 [label="X[1] <= 1.5\ngini = 0.474\nsamples = 331\nvalue = [203, 128, 0]"] ;
3 -> 4 ;
5 [label="gini = 0.476\nsamples = 105\nvalue = [64, 41, 0]"] ;
4 -> 5 ;
6 [label="gini = 0.474\nsamples = 226\nvalue = [139, 87, 0]"] ;
4 -> 6 ;
7 [label="gini = 0.444\nsamples = 3\nvalue = [1, 2, 0]"] ;
3 -> 7 ;
8 [label="X[3] <= 5.5\ngini = 0.451\nsamples = 1154\nvalue = [758, 396, 0]"] ;
2 -> 8 ;
9 [label="X[1] <= 1.5\ngini = 0.45\nsamples = 1145\nvalue = [754, 391, 0]"] ;
8 -> 9 ;
10 [label="gini = 0.434\nsamples = 195\nvalue = [133, 62, 0]"] ;
9 -> 10 ;
11 [label="X[1] <= 5.5\ngini = 0.453\nsamples = 950\nvalue = [621, 329, 0]"] ;
9 -> 11 ;
12 [label="gini = 0.453\nsamples = 941\nvalue = [615, 326, 0]"] ;
11 -> 12 ;
13 [label="gini = 0.444\nsamples = 9\nvalue = [6, 3, 0]"] ;
11 -> 13 ;
14 [label="X[1] <= 5.5\ngini = 0.494\nsamples = 9\nvalue = [4, 5, 0]"] ;
8 -> 14 ;
15 [label="gini = 0.469\nsamples = 8\nvalue = [3, 5, 0]"] ;
14 -> 15 ;
16 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;
14 -> 16 ;
17 [label="X[2] <= 5.5\ngini = 0.423\nsamples = 3416\nvalue = [2381, 1034, 1]"] ;
1 -> 17 ;
18 [label="X[1] <= 5.5\ngini = 0.421\nsamples = 3366\nvalue = [2352, 1013, 1]"] ;
17 -> 18 ;
19 [label="X[3] <= 1.5\ngini = 0.422\nsamples = 3331\nvalue = [2326, 1004, 1]"] ;
18 -> 19 ;
20 [label="X[1] <= 1.5\ngini = 0.433\nsamples = 316\nvalue = [216, 100, 0]"] ;
19 -> 20 ;
21 [label="gini = 0.426\nsamples = 78\nvalue = [54, 24, 0]"] ;
20 -> 21 ;
22 [label="gini = 0.435\nsamples = 238\nvalue = [162, 76, 0]"] ;
20 -> 22 ;
23 [label="X[1] <= 1.5\ngini = 0.42\nsamples = 3015\nvalue = [2110, 904, 1]"] ;
19 -> 23 ;
24 [label="X[3] <= 5.5\ngini = 0.416\nsamples = 385\nvalue = [272, 112, 1]"] ;
23 -> 24 ;
25 [label="gini = 0.418\nsamples = 383\nvalue = [270, 112, 1]"] ;
24 -> 25 ;
26 [label="gini = 0.0\nsamples = 2\nvalue = [2, 0, 0]"] ;
24 -> 26 ;
27 [label="X[3] <= 5.5\ngini = 0.421\nsamples = 2630\nvalue = [1838, 792, 0]"] ;
23 -> 27 ;
28 [label="gini = 0.421\nsamples = 2622\nvalue = [1833, 789, 0]"] ;
27 -> 28 ;
29 [label="gini = 0.469\nsamples = 8\nvalue = [5, 3, 0]"] ;
27 -> 29 ;
30 [label="X[3] <= 1.5\ngini = 0.382\nsamples = 35\nvalue = [26, 9, 0]"] ;
18 -> 30 ;
31 [label="gini = 0.0\nsamples = 3\nvalue = [3, 0, 0]"] ;
30 -> 31 ;
32 [label="gini = 0.404\nsamples = 32\nvalue = [23, 9, 0]"] ;
30 -> 32 ;
33 [label="X[3] <= 1.5\ngini = 0.487\nsamples = 50\nvalue = [29, 21, 0]"] ;
17 -> 33 ;
34 [label="X[1] <= 1.5\ngini = 0.278\nsamples = 6\nvalue = [5, 1, 0]"] ;
33 -> 34 ;
35 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;
34 -> 35 ;
36 [label="gini = 0.32\nsamples = 5\nvalue = [4, 1, 0]"] ;
34 -> 36 ;
37 [label="X[1] <= 1.5\ngini = 0.496\nsamples = 44\nvalue = [24, 20, 0]"] ;
33 -> 37 ;
38 [label="X[3] <= 5.5\ngini = 0.42\nsamples = 10\nvalue = [7, 3, 0]"] ;
37 -> 38 ;
39 [label="gini = 0.278\nsamples = 6\nvalue = [5, 1, 0]"] ;
38 -> 39 ;
40 [label="gini = 0.5\nsamples = 4\nvalue = [2, 2, 0]"] ;
38 -> 40 ;
41 [label="X[1] <= 5.5\ngini = 0.5\nsamples = 34\nvalue = [17, 17, 0]"] ;
37 -> 41 ;
42 [label="X[3] <= 5.5\ngini = 0.499\nsamples = 31\nvalue = [16, 15, 0]"] ;
41 -> 42 ;
43 [label="gini = 0.497\nsamples = 24\nvalue = [13, 11, 0]"] ;
42 -> 43 ;
44 [label="gini = 0.49\nsamples = 7\nvalue = [3, 4, 0]"] ;
42 -> 44 ;
45 [label="gini = 0.444\nsamples = 3\nvalue = [1, 2, 0]"] ;
41 -> 45 ;
46 [label="X[1] <= 1.5\ngini = 0.474\nsamples = 20951\nvalue = [12876, 8057, 18]"] ;
0 -> 46 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
47 [label="X[2] <= 1.5\ngini = 0.381\nsamples = 621\nvalue = [462, 159, 0]"] ;
46 -> 47 ;
48 [label="X[0] <= 5.5\ngini = 0.426\nsamples = 133\nvalue = [92, 41, 0]"] ;
47 -> 48 ;
49 [label="X[3] <= 5.5\ngini = 0.394\nsamples = 115\nvalue = [84, 31, 0]"] ;
48 -> 49 ;
50 [label="X[3] <= 1.5\ngini = 0.388\nsamples = 114\nvalue = [84, 30, 0]"] ;
49 -> 50 ;
51 [label="gini = 0.397\nsamples = 33\nvalue = [24, 9, 0]"] ;
50 -> 51 ;
52 [label="gini = 0.384\nsamples = 81\nvalue = [60, 21, 0]"] ;
50 -> 52 ;
53 [label="gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
49 -> 53 ;
54 [label="X[3] <= 1.5\ngini = 0.494\nsamples = 18\nvalue = [8, 10, 0]"] ;
48 -> 54 ;
55 [label="gini = 0.0\nsamples = 4\nvalue = [0, 4, 0]"] ;
54 -> 55 ;
56 [label="gini = 0.49\nsamples = 14\nvalue = [8, 6, 0]"] ;
54 -> 56 ;
57 [label="X[2] <= 5.5\ngini = 0.367\nsamples = 488\nvalue = [370, 118, 0]"] ;
47 -> 57 ;
58 [label="X[0] <= 5.5\ngini = 0.368\nsamples = 485\nvalue = [367, 118, 0]"] ;
57 -> 58 ;
59 [label="X[3] <= 1.5\ngini = 0.363\nsamples = 391\nvalue = [298, 93, 0]"] ;
58 -> 59 ;
60 [label="gini = 0.334\nsamples = 52\nvalue = [41, 11, 0]"] ;
59 -> 60 ;
61 [label="gini = 0.367\nsamples = 339\nvalue = [257, 82, 0]"] ;
59 -> 61 ;
62 [label="X[3] <= 1.5\ngini = 0.39\nsamples = 94\nvalue = [69, 25, 0]"] ;
58 -> 62 ;
63 [label="gini = 0.48\nsamples = 10\nvalue = [6, 4, 0]"] ;
62 -> 63 ;
64 [label="X[3] <= 5.5\ngini = 0.375\nsamples = 84\nvalue = [63, 21, 0]"] ;
62 -> 64 ;
65 [label="gini = 0.378\nsamples = 83\nvalue = [62, 21, 0]"] ;
64 -> 65 ;
66 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;
64 -> 66 ;
67 [label="gini = 0.0\nsamples = 3\nvalue = [3, 0, 0]"] ;
57 -> 67 ;
68 [label="X[1] <= 5.5\ngini = 0.476\nsamples = 20330\nvalue = [12414, 7898, 18]"] ;
46 -> 68 ;
69 [label="X[2] <= 5.5\ngini = 0.474\nsamples = 19625\nvalue = [12065, 7550, 10]"] ;
68 -> 69 ;
70 [label="X[2] <= 1.5\ngini = 0.474\nsamples = 19443\nvalue = [11972, 7463, 8]"] ;
69 -> 70 ;
71 [label="X[3] <= 1.5\ngini = 0.485\nsamples = 2044\nvalue = [1200, 843, 1]"] ;
70 -> 71 ;
72 [label="X[0] <= 5.5\ngini = 0.479\nsamples = 285\nvalue = [172, 113, 0]"] ;
71 -> 72 ;
73 [label="gini = 0.482\nsamples = 261\nvalue = [155, 106, 0]"] ;
72 -> 73 ;
74 [label="gini = 0.413\nsamples = 24\nvalue = [17, 7, 0]"] ;
72 -> 74 ;
75 [label="X[0] <= 5.5\ngini = 0.486\nsamples = 1759\nvalue = [1028, 730, 1]"] ;
71 -> 75 ;
76 [label="X[3] <= 5.5\ngini = 0.486\nsamples = 1686\nvalue = [988, 697, 1]"] ;
75 -> 76 ;
77 [label="gini = 0.486\nsamples = 1679\nvalue = [984, 694, 1]"] ;
76 -> 77 ;
78 [label="gini = 0.49\nsamples = 7\nvalue = [4, 3, 0]"] ;
76 -> 78 ;
79 [label="gini = 0.495\nsamples = 73\nvalue = [40, 33, 0]"] ;
75 -> 79 ;
80 [label="X[3] <= 5.5\ngini = 0.472\nsamples = 17399\nvalue = [10772, 6620, 7]"] ;
70 -> 80 ;
81 [label="X[0] <= 5.5\ngini = 0.472\nsamples = 17361\nvalue = [10745, 6609, 7]"] ;
80 -> 81 ;
82 [label="X[3] <= 1.5\ngini = 0.472\nsamples = 16719\nvalue = [10337, 6377, 5]"] ;
81 -> 82 ;
83 [label="gini = 0.478\nsamples = 505\nvalue = [306, 199, 0]"] ;
82 -> 83 ;
84 [label="gini = 0.472\nsamples = 16214\nvalue = [10031, 6178, 5]"] ;
82 -> 84 ;
85 [label="X[3] <= 1.5\ngini = 0.466\nsamples = 642\nvalue = [408, 232, 2]"] ;
81 -> 85 ;
86 [label="gini = 0.476\nsamples = 23\nvalue = [14, 9, 0]"] ;
85 -> 86 ;
87 [label="gini = 0.465\nsamples = 619\nvalue = [394, 223, 2]"] ;
85 -> 87 ;
88 [label="X[0] <= 5.5\ngini = 0.411\nsamples = 38\nvalue = [27, 11, 0]"] ;
80 -> 88 ;
89 [label="gini = 0.4\nsamples = 29\nvalue = [21, 8, 0]"] ;
88 -> 89 ;
90 [label="gini = 0.444\nsamples = 9\nvalue = [6, 3, 0]"] ;
88 -> 90 ;
91 [label="X[0] <= 5.5\ngini = 0.51\nsamples = 182\nvalue = [93, 87, 2]"] ;
69 -> 91 ;
92 [label="X[3] <= 1.5\ngini = 0.511\nsamples = 155\nvalue = [81, 72, 2]"] ;
91 -> 92 ;
93 [label="gini = 0.408\nsamples = 7\nvalue = [5, 2, 0]"] ;
92 -> 93 ;
94 [label="X[3] <= 5.5\ngini = 0.512\nsamples = 148\nvalue = [76, 70, 2]"] ;
92 -> 94 ;
95 [label="gini = 0.5\nsamples = 68\nvalue = [35, 33, 0]"] ;
94 -> 95 ;
96 [label="gini = 0.523\nsamples = 80\nvalue = [41, 37, 2]"] ;
94 -> 96 ;
97 [label="X[3] <= 1.5\ngini = 0.494\nsamples = 27\nvalue = [12, 15, 0]"] ;
91 -> 97 ;
98 [label="gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
97 -> 98 ;
99 [label="X[3] <= 5.5\ngini = 0.497\nsamples = 26\nvalue = [12, 14, 0]"] ;
97 -> 99 ;
100 [label="gini = 0.48\nsamples = 10\nvalue = [6, 4, 0]"] ;
99 -> 100 ;
101 [label="gini = 0.469\nsamples = 16\nvalue = [6, 10, 0]"] ;
99 -> 101 ;
102 [label="X[0] <= 5.5\ngini = 0.511\nsamples = 705\nvalue = [349, 348, 8]"] ;
68 -> 102 ;
103 [label="X[3] <= 5.5\ngini = 0.483\nsamples = 76\nvalue = [45, 31, 0]"] ;
102 -> 103 ;
104 [label="X[2] <= 5.5\ngini = 0.479\nsamples = 68\nvalue = [41, 27, 0]"] ;
103 -> 104 ;
105 [label="X[2] <= 1.5\ngini = 0.481\nsamples = 67\nvalue = [40, 27, 0]"] ;
104 -> 105 ;
106 [label="X[3] <= 1.5\ngini = 0.496\nsamples = 11\nvalue = [6, 5, 0]"] ;
105 -> 106 ;
107 [label="gini = 0.444\nsamples = 3\nvalue = [2, 1, 0]"] ;
106 -> 107 ;
108 [label="gini = 0.5\nsamples = 8\nvalue = [4, 4, 0]"] ;
106 -> 108 ;
109 [label="X[3] <= 1.5\ngini = 0.477\nsamples = 56\nvalue = [34, 22, 0]"] ;
105 -> 109 ;
110 [label="gini = 0.444\nsamples = 3\nvalue = [2, 1, 0]"] ;
109 -> 110 ;
111 [label="gini = 0.478\nsamples = 53\nvalue = [32, 21, 0]"] ;
109 -> 111 ;
112 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;
104 -> 112 ;
113 [label="X[2] <= 5.5\ngini = 0.5\nsamples = 8\nvalue = [4, 4, 0]"] ;
103 -> 113 ;
114 [label="gini = 0.5\nsamples = 2\nvalue = [1, 1, 0]"] ;
113 -> 114 ;
115 [label="gini = 0.5\nsamples = 6\nvalue = [3, 3, 0]"] ;
113 -> 115 ;
116 [label="X[3] <= 1.5\ngini = 0.512\nsamples = 629\nvalue = [304, 317, 8]"] ;
102 -> 116 ;
117 [label="X[2] <= 5.5\ngini = 0.469\nsamples = 8\nvalue = [3, 5, 0]"] ;
116 -> 117 ;
118 [label="X[2] <= 1.5\ngini = 0.408\nsamples = 7\nvalue = [2, 5, 0]"] ;
117 -> 118 ;
119 [label="gini = 0.5\nsamples = 4\nvalue = [2, 2, 0]"] ;
118 -> 119 ;
120 [label="gini = 0.0\nsamples = 3\nvalue = [0, 3, 0]"] ;
118 -> 120 ;
121 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;
117 -> 121 ;
122 [label="X[2] <= 1.5\ngini = 0.512\nsamples = 621\nvalue = [301, 312, 8]"] ;
116 -> 122 ;
123 [label="X[3] <= 5.5\ngini = 0.49\nsamples = 14\nvalue = [8, 6, 0]"] ;
122 -> 123 ;
124 [label="gini = 0.486\nsamples = 12\nvalue = [7, 5, 0]"] ;
123 -> 124 ;
125 [label="gini = 0.5\nsamples = 2\nvalue = [1, 1, 0]"] ;
123 -> 125 ;
126 [label="X[3] <= 5.5\ngini = 0.513\nsamples = 607\nvalue = [293, 306, 8]"] ;
122 -> 126 ;
127 [label="X[2] <= 5.5\ngini = 0.499\nsamples = 174\nvalue = [83, 91, 0]"] ;
126 -> 127 ;
128 [label="gini = 0.5\nsamples = 156\nvalue = [77, 79, 0]"] ;
127 -> 128 ;
129 [label="gini = 0.444\nsamples = 18\nvalue = [6, 12, 0]"] ;
127 -> 129 ;
130 [label="X[2] <= 5.5\ngini = 0.518\nsamples = 433\nvalue = [210, 215, 8]"] ;
126 -> 130 ;
131 [label="gini = 0.494\nsamples = 9\nvalue = [5, 4, 0]"] ;
130 -> 131 ;
132 [label="gini = 0.518\nsamples = 424\nvalue = [205, 211, 8]"] ;
130 -> 132 ;
}
Full Tree code above (mother father brother sister):
Reduced tree used to explain(only mother and father removed brother and sister):
0 notes
Text
Logistic Regression
Below is the code and output for the Logistic Regression testing I am doing for this project. Since I am looking at family history and alcoholism I am looking at fathers with grandfathers that were alcoholics and if the individual had drinks in the last 12 months and also if the individual had school or work issues.
The first test is fathers with grandfathers that had drinking problems and if the individual had a drink in the last 12 months. In this test there were 32,588 observations. The regression is significant with a p value of 9.120 e-43. The Coef of the intercept is -1.0617 and 0.3495 of S2AQ3 ( had a drink in the last 12 months) Crit 1 = -1.06 + 0.35 * S2AQ3. Looking at the odds ratio of this test individuals who drank atleast 1 drink in the last 12 months are 1.4 times more likely to have a father and grandfather with alcohol issues. There is 95% certainty that the true population odds ratio falls between 1.35 and 1.5 for this situation. (OR=1.42, 95% CI = 1.35-1.5, p=.0001)
The second test is fathers with grandfathers that had drinking problems and if the individual had issues with school or work. In this test there were 26,434 observations. The regression is significant with a p value of 9.114 e-91. The Coef of the intercept is -0.7530 and 1.4027 of S2BQ1A20 ( had problems with school or work) Crit 1 = -0.7530 + 1.4027 * S2BQ1A20. Looking at the odds ratio of this test individuals who had school or work issues are 4.07 times more likely to have a father and grandfather with alcohol issues. There is 95% certainty that the true popuation odds ratio falls between 3.53 and 4.68 for this situation. (OR=4.07, 95% CI = 3.53 - 4.7, p=.0001)
In my third test I combined both having atleast 1 drink in the last 12 months and also having school or work problems with having fathers and grandfathers with alcohol issues. In this test there were 26,432 observations and have a significant p value of 2.124e-89. The coef of this test is intercept of -0.7404, S2AQ3 0.0157 and S2BQ1A20 1.4006. When combining both of these you will notice the p value of S2AQ3 is no longer significant. Looking at the odds ratio of this test due to confounding I will adjust by only looking at the sifnigicant odds value of S2BQ1A20 ones with work issues are 4.06 times more likely when including atleast 1 drink. (OR=4.06, 95% CI = 3.52- 4.67, p=.0001)
I am able to reject the null hypothesis for all of the tests above due to the very low P values on each of the tests.
There is confounding in the third test because by combining both S2AQ3 and S2BQ1A20, S2AQ3 is no longer significant (p 0.625) unlike without S2BQ1A20 included. With this being the case as I described above I adjusted the odds value i did not include S2AQ3.
Optimization terminated successfully.
Current function value: 0.610574
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: crit1 No. Observations: 32588
Model: Logit Df Residuals: 32586
Method: MLE Df Model: 1
Date: Thu, 28 Jan 2021 Pseudo R-squ.: 0.004700
Time: 15:49:18 Log-Likelihood: -19897.
converged: True LL-Null: -19991.
Covariance Type: nonrobust LLR p-value: 9.120e-43
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -1.0617 0.021 -50.315 0.000 -1.103 -1.020
S2AQ3 0.3495 0.026 13.577 0.000 0.299 0.400
==============================================================================
Odds Ratios
Intercept 0.345881
S2AQ3 1.418371
dtype: float64
Lower CI Upper CI OR
Intercept 0.331869 0.360485 0.345881
S2AQ3 1.348583 1.491771 1.418371
Optimization terminated successfully.
Current function value: 0.627545
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: crit1 No. Observations: 26434
Model: Logit Df Residuals: 26432
Method: MLE Df Model: 1
Date: Thu, 28 Jan 2021 Pseudo R-squ.: 0.01215
Time: 15:49:18 Log-Likelihood: -16589.
converged: True LL-Null: -16793.
Covariance Type: nonrobust LLR p-value: 9.114e-91
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.7530 0.013 -56.142 0.000 -0.779 -0.727
S2BQ1A20 1.4027 0.072 19.538 0.000 1.262 1.543
==============================================================================
Odds Ratios
Intercept 0.470974
S2BQ1A20 4.066108
dtype: float64
Lower CI Upper CI OR
Intercept 0.458756 0.483519 0.470974
S2BQ1A20 3.532385 4.680473 4.066108
Optimization terminated successfully.
Current function value: 0.627559
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: crit1 No. Observations: 26432
Model: Logit Df Residuals: 26429
Method: MLE Df Model: 2
Date: Thu, 28 Jan 2021 Pseudo R-squ.: 0.01216
Time: 15:49:18 Log-Likelihood: -16588.
converged: True LL-Null: -16792.
Covariance Type: nonrobust LLR p-value: 2.124e-89
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.7404 0.029 -25.800 0.000 -0.797 -0.684
S2AQ3 -0.0157 0.032 -0.488 0.625 -0.079 0.047
S2BQ1A20 1.4006 0.072 19.477 0.000 1.260 1.542
==============================================================================
Odds Ratios
Intercept 0.476902
S2AQ3 0.984436
S2BQ1A20 4.057551
dtype: float64
Lower CI Upper CI OR
Intercept 0.450818 0.504496 0.476902
S2AQ3 0.924348 1.048431 0.984436
S2BQ1A20 3.524161 4.671671 4.057551
0 notes
Text
In this tests below I am looking at natural family members and their association to drinking at least 1 drink in the last 12 months. I also included situations where the natural fathers grandfather and grandmother also had drinking problems. I recoded the yes and no responses to 0 and 1 and removed values of 9. In this test there were 43,062 observations with a dependent variable of Problem. The P value is very low with a value of 8.33e-15. The coef of the intercept is 0.4124 and 0.0591 for S2AQ3 (had at least 1 drink in last year) this makes the beta (0.4124 - 0.0591) = 0.3533. I can reject the null hypothesis and state there is a significant and positive association between family member an drinking in the last 12 months. The next was comparing the same family members as before but looking at the number of drinks. I removed values of 99. This association had 7891 observations a low P value of 0.00567 and an Coef of the intercept of .5095 and S2AQ4D (Number of drinks) of 0.0145. The beta for this association is (0.5095 - 0.0145) = 0.495.I can reject the null hypothesis and state there is a significant and positive association between family member and number of drinks. I then compared the same family members as above (Problem) and the association to having work and school issues due to drinking problems. In this situation I recoded the yes and no responses to 0 and 1 and removed values of 9. In this test there were 34,572 observations with a dependent variable of Problem. The P value is very low with a value of 2.85e-209. The coef of the intercept is 0.4663 and 0.7436 for S2AQ1A20 (had work and school issues) this makes the beta (0.4663 - 0.7436) = -0.2773. I can reject the null hypothesis and state there is a significant association between family member and work and school problems. I combined all of the above into a test to check the family association with drinking at least 1 drink in the last 12 months, number of drinks and having school issues. This had 7866 observations with a P value of 2.56e-26. The Intercept coef for this is 0.2517, s2AQ3 0.2517, S2AQ4D 0.0087, S2BQ1A20 0.5379. The beta is (0.2517 - 0.2517 - 0.0087 - 0.5379) = -0.5466. I can reject the null hypothesis in this case but there is also a negative beta value.
While I did not find evidence of a Confounding variable, I feel with work or school issues with drinking (S2BQ1A20) a confounding variable would be (S2AQ4D) number of drinks as the number of drinks increase this would have a positive effect on the issues a school or work. I feel this is the case because work and school issues have an association with family members with drinking problems but just saying that people with alcohol issues will have work and school problems is not going to be correct because one that drinks a small amount of drinks is not going to have the same issues as one that drinks large amounts of drinks. This is just an assumption I am making.
Looking at the Q-Q plot I would say it is not normally distributed. While there are points near or on the line there is large number of points at approximately -0.75 above 1 the points are higher than the line. This is showing outcomes to the left of the line. This is showing the underlying distribution is not normal but it is not far off. Looking at the influence plot the majority of points are within +-1 but there are still a number of them above this. This shows we have some outliers (2+) in the data and a few extreme outliers (3+).
Code:
# -*- coding: utf-8 -*-
"""
Spyder Editor
#Justin K
#NESARC Family History of Alcoholism
import pandas
import numpy
import statsmodels.api as sm
import seaborn
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
data = pandas.read_csv('nesarc_pds.csv', low_memory = False)
#setting to numeric
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')
data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce')
data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce')
data['S2AQ4D'] = pandas.to_numeric(data['S2AQ4D'], errors='coerce')
data['S2BQ1A20'] = pandas.to_numeric(data['S2BQ1A20'], errors='coerce')
#Family members with alcohol issues (recode 1,2 to 0,1 )
# after recoding 9s to missing
recode1 = {1: 1, 2: 0}
data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)
data['S2DQ1']= data['S2DQ1'].map(recode1)
data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)
data['S2DQ2']= data['S2DQ2'].map(recode1)
data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)
data['S2DQ3C2']= data['S2DQ3C2'].map(recode1)
data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)
data['S2DQ4C2']= data['S2DQ4C2'].map(recode1)
data['S2AQ3']=data['S2AQ3'].replace(9, numpy.nan)
data['S2AQ3']= data['S2AQ3'].map(recode1)
data['S2BQ1A20']=data['S2BQ1A20'].replace(9, numpy.nan)
data['S2BQ1A20']= data['S2BQ1A20'].map(recode1)
data['S2AQ4D']=data['S2AQ4D'].replace(99, numpy.nan)
# check recode above
chk1 = data['S2DQ1'].value_counts(sort=False, dropna=False)
print (chk1)
# sum Problem
data ['Problem'] = numpy.nansum([data['S2DQ1'], data['S2DQ2'], data['S2DQ3C2'], data['S2DQ4C2']], axis=0)
# check sum
chksum=data[['S2DQ1', 'S2DQ2', 'S2DQ3C2','S2DQ4C2', 'Problem']]
chksum.head(n=50)
chk2 = data['Problem'].value_counts(sort=False, dropna=False)
print (chk2)
#adjust counts if fathers father was Alcoholic
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')
data['S2DQ11'] = pandas.to_numeric(data['S2DQ11'], errors='coerce')
def crit1 (row):
if row['S2DQ1']==1 or row['S2DQ11'] == 1 :
return 1
elif row['S2DQ1']==2 and row['S2DQ11']==2 :
return 0
data['crit1'] = data.apply (lambda row: crit1 (row),axis=1)
chk3 = data['crit1'].value_counts(sort=False, dropna=False)
print (chk3)
#adjust counts if mothers father was alc
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')
data['S2DQ21'] = pandas.to_numeric(data['S2DQ21'], errors='coerce')
def crit2 (row):
if row['S2DQ2']==1 or row['S2DQ21'] == 1 :
return 1
elif row['S2DQ2']==2 and row['S2DQ21']==2 :
return 0
data['crit2'] = data.apply (lambda row: crit2 (row),axis=1)
chk4 = data['crit2'].value_counts(sort=False, dropna=False)
print (chk4)
#Linear Regression Model
reg1 = smf.ols('Problem ~ S2AQ3', data=data).fit()
print (reg1.summary())
crit2
# calculating means for regression observations
sub1 = data[['Problem', 'S2AQ3']].dropna()
# group means & std deviation
print ("Mean")
ds1 = sub1.groupby('S2AQ3').mean()
print (ds1)
print ("Standard deviation")
ds2 = sub1.groupby('S2AQ3').std()
print (ds2)
#bivariate bar graph
seaborn.catplot(x="S2AQ3", y="Problem", data=sub1, kind="bar", ci=None)
plt.xlabel('Problem 0 yes 1 no')
plt.ylabel('Mean Number drank atleast 1 in last 12 months')
#Part 2
#Linear Regression Model
reg2 = smf.ols('Problem ~ S2AQ4D', data=data).fit()
print (reg2.summary())
# calculating means for regression observations
sub2 = data[['Problem', 'S2AQ4D']].dropna()
# group means & std deviation
print ("Mean")
ds3 = sub2.groupby('S2AQ4D').mean()
print (ds3)
print ("Standard deviation")
ds4 = sub2.groupby('S2AQ4D').std()
print (ds4)
#bivariate bar graph
seaborn.catplot(x="S2AQ4D", y="Problem", data=sub2, kind="bar", ci=None)
plt.xlabel('Problem 0 yes 1 no')
plt.ylabel('Mean Number dranks on dat drinking in last 12 months')
sub2['S2AQ4D_c'] = (sub2['S2AQ4D'] - sub2['S2AQ4D'].mean())
print (sub2['S2AQ4D_c'].mean())
#Part 3 S2BQ1A20
#Linear Regression Model
reg3 = smf.ols('Problem ~ S2BQ1A20', data=data).fit()
print (reg3.summary())
# calculating means for regression observations
sub3 = data[['Problem', 'S2BQ1A20']].dropna()
# group means & std deviation
print ("Mean")
ds5 = sub3.groupby('S2BQ1A20').mean()
print (ds5)
print ("Standard deviation")
ds6 = sub3.groupby('S2BQ1A20').std()
print (ds6)
#bivariate bar graph
seaborn.catplot(x="S2BQ1A20", y="Problem", data=sub3, kind="bar", ci=None)
plt.xlabel('Problem 0 yes 1 no')
plt.ylabel('Mean Number have job or school problems')
#Linear Regression Model
reg4 = smf.ols('Problem ~ S2AQ3 + S2AQ4D + S2BQ1A20', data=data).fit()
print (reg4.summary())
# calculating means for regression observations
sub4 = data[['S2AQ3', 'S2AQ4D', 'S2BQ1A20']].dropna()
sub4 = sub2['S2AQ4D_c']
sub4 = data['Problem']
#Linear Regression Model
reg5 = smf.ols('Problem ~ S2AQ4D_c' , data=sub2).fit()
print (reg5.summary())
#Q-Q
fig4=sm.qqplot(reg5.resid, line='r')
# simple plot of residuals
stdres=pandas.DataFrame(reg5.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
# leverage plot
fig5=sm.graphics.influence_plot(reg5, size=8)
print(fig5)
#upper-case all DataFrame column names
data.columns = list(map(str.upper, data.columns))
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Output:
OLS Regression Results
Dep. Variable: Problem R-squared: 0.015
Model: OLS Adj. R-squared: 0.015
Method: Least Squares F-statistic: 59.37
Date: Fri, 22 Jan 2021 Prob (F-statistic): 2.56e-26
Time: 10:28:06 Log-Likelihood: -9480.7
No. Observations: 7866 AIC: 1.897e+04
Df Residuals: 7863 BIC: 1.899e+04
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.2517 0.006 39.035 0.000 0.239 0.264
S2AQ3 0.2517 0.006 39.035 0.000 0.239 0.264
S2AQ4D 0.0087 0.005 1.663 0.096 -0.002 0.019
S2BQ1A20 0.5379 0.051 10.540 0.000 0.438 0.638
Omnibus: 2032.059 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4313.010
Skew: 1.520 Prob(JB): 0.00
Kurtosis: 4.979 Cond. No. 1.69e+16
OLS Regression Results
Dep. Variable: Problem R-squared: 0.001
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 7.658
Date: Fri, 22 Jan 2021 Prob (F-statistic): 0.00567
Time: 10:28:06 Log-Likelihood: -9559.0
No. Observations: 7891 AIC: 1.912e+04
Df Residuals: 7889 BIC: 1.914e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.5348 0.009 58.456 0.000 0.517 0.553
S2AQ4D_c 0.0145 0.005 2.767 0.006 0.004 0.025
Omnibus: 2090.081 Durbin-Watson: 2.022
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4509.428
Skew: 1.550 Prob(JB): 0.00
Kurtosis: 5.026 Cond. No. 1.75
0 notes
Text
Regression modeling - Linear Regression week 2:
Below is the program I wrote to test for basic linear regression. I am looking at natural family members and their association to drinking atleast 1 drink in the last 12 months. I also included situations where the father and grandfather as well as grandmother also had alcohol issues. In the code below you will see where I also had to recode the yes and no answers from 1 and 2 to 0 and 1. I also removed the values of 9 for these. The dependent variable in this situation is Problem and there were 43,093 observations made. The coefficient of the intercept is 0.5279 and -0.0571 of S2AQ3 which is if the person had a drink in the last year. This makes the Beta (0.5279 - 0.0571) = 0.4708. The P value of this is very small so I can reject the null hypothesis and state there is a significant and positively association between family member alcoholics and drinking in the last 12 months.
# -*- coding: utf-8 -*-
"""
Spyder Editor
"""
#Justin K
#NESARC Family History of Alcoholism
import pandas
import numpy
import statsmodels.api as sm
import seaborn
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
data = pandas.read_csv('nesarc_pds.csv', low_memory = False)
#setting to numeric
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')
data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce')
data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce')
#Family members with alcohol issues (recode 1,2 to 0,1 )
# after recoding 9s to missing
recode1 = {1: 1, 2: 0}
data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)
data['S2DQ1']= data['S2DQ1'].map(recode1)
data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)
data['S2DQ2']= data['S2DQ2'].map(recode1)
data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)
data['S2DQ3C2']= data['S2DQ3C2'].map(recode1)
data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)
data['S2DQ4C2']= data['S2DQ4C2'].map(recode1)
data['S2AQ3']=data['S2AQ3'].replace(9, numpy.nan)
data['S2AQ3']= data['S2AQ3'].map(recode1)
# check recode above
chk1 = data['S2DQ1'].value_counts(sort=False, dropna=False)
print (chk1)
# sum Problem
data ['Problem'] = numpy.nansum([data['S2DQ1'], data['S2DQ2'], data['S2DQ3C2'], data['S2DQ4C2']], axis=0)
# check sum
chksum=data[['S2DQ1', 'S2DQ2', 'S2DQ3C2','S2DQ4C2', 'Problem']]
chksum.head(n=50)
chk2 = data['Problem'].value_counts(sort=False, dropna=False)
print (chk2)
#adjust counts if fathers father was Alcoholic
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')
data['S2DQ11'] = pandas.to_numeric(data['S2DQ11'], errors='coerce')
def crit1 (row):
if row['S2DQ1']==1 or row['S2DQ11'] == 1 :
return 1
elif row['S2DQ1']==2 and row['S2DQ11']==2 :
return 0
data['crit1'] = data.apply (lambda row: crit1 (row),axis=1)
chk3 = data['crit1'].value_counts(sort=False, dropna=False)
print (chk3)
#adjust counts if mothers father was alc
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')
data['S2DQ21'] = pandas.to_numeric(data['S2DQ21'], errors='coerce')
def crit2 (row):
if row['S2DQ2']==1 or row['S2DQ21'] == 1 :
return 1
elif row['S2DQ2']==2 and row['S2DQ21']==2 :
return 0
data['crit2'] = data.apply (lambda row: crit2 (row),axis=1)
chk4 = data['crit2'].value_counts(sort=False, dropna=False)
print (chk4)
#Linear Regression Model
reg1 = smf.ols('Problem ~ S2AQ3', data=data).fit()
print (reg1.summary())
crit2
# calculating means for regression observations
sub1 = data[['Problem', 'S2AQ3']].dropna()
# group means & std deviation
print ("Mean")
ds1 = sub1.groupby('S2AQ3').mean()
print (ds1)
print ("Standard deviation")
ds2 = sub1.groupby('S2AQ3').std()
print (ds2)
#bivariate bar graph
seaborn.catplot(x="S2AQ3", y="Problem", data=sub1, kind="bar", ci=None)
plt.xlabel('Problem 0 yes 1 no')
plt.ylabel('Mean Number drank atleast 1 in last 12 months')
#upper-case all DataFrame column names
data.columns = list(map(str.upper, data.columns))
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Output:
1.000000 8124
0.000000 32445
nan 2524
Name: S2DQ1, dtype: int64
2.000000 3413
1.000000 9059
0.000000 29525
4.000000 182
3.000000 914
Name: Problem, dtype: int64
1.000000 9878
nan 33215
Name: crit1, dtype: int64
nan 40762
1.000000 2331
Name: crit2, dtype: int64
OLS Regression Results
==============================================================================
Dep. Variable: Problem R-squared: 0.001
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 60.30
Date: Fri, 15 Jan 2021 Prob (F-statistic): 8.33e-15
Time: 14:39:11 Log-Likelihood: -49491.
No. Observations: 43062 AIC: 9.899e+04
Df Residuals: 43060 BIC: 9.900e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.4124 0.006 68.563 0.000 0.401 0.424
S2AQ3 0.0591 0.008 7.765 0.000 0.044 0.074
==============================================================================
Omnibus: 14467.736 Durbin-Watson: 2.008
Prob(Omnibus): 0.000 Jarque-Bera (JB): 40856.637
Skew: 1.815 Prob(JB): 0.00
Kurtosis: 6.098 Cond. No. 3.03
==============================================================================
Mean
Problem
S2AQ3
0.000000 0.412447
1.000000 0.471499
Standard deviation
Problem
S2AQ3
0.000000 0.747297
1.000000 0.773293
0 notes
Text
Part 1:
The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a study that included (N=43,0093) individuals. This study included researched family history of Alcoholism. I am using this study data to determine if family history influences alcoholism. The study shows that the majority approximately (76%) of natural fathers were not alcoholics or problem drinkers. Even less natural mothers approximately (94%) were not alcoholics or problem drinkers. Full Brothers and Sisters were also included in the study. Approximately (88%) of full brothers did not have alcohol issues and 94% of full sisters did not have alcohol issues. I combined the family members into a new grouping I called Family this way I can better research how one member compares to the entire family. I am also looking into the quantity (1-98) of alcoholic drinks consumed compared to the size of the drinks based on the timeframe. The study looked at the how often coolers were consumed over the last 12 months. The majority of individuals did not answer these questions so I will need to take this into account. The majority of individuals that did consume coolers only consumed 1-2 in the last 12 months with the next largest consumption was monthly. Since the majority of individuals did not consume coolers in the last 12 months the size of cooler (1-47) that was consumed was left blank in the study. Only (18%) of individuals answered this question. I am going to research these metrics to determine if family history as well as consumption has an effect on alcoholism.
Part 2:
The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a survey conducted face to face with (N=43,093) individuals. The purpose of the survey is to gather detailed measurements of drug and alcohol issues in the US population. The study was conducted in 2 waves across the 50 US states. The first wave took place in 2001 – 2002. The second wave which was a re-interview of (N=34.653) took place in 2004-2005.
Part 3:
The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a study that included (N=43,0093) individuals. This study included researched family history of Alcoholism. I am using this study data to determine if family history influences alcoholism. The study shows that the majority approximately (76%) of natural fathers were not alcoholics or problem drinkers. Even less natural mothers approximately (94%) were not alcoholics or problem drinkers. Full Brothers and Sisters were also included in the study. Approximately (88%) of full brothers did not have alcohol issues and 94% of full sisters did not have alcohol issues. I combined the family members into a new grouping I called Family this way I can better research how one member compares to the entire family. I am also looking into the quantity (1-98) of alcoholic drinks consumed compared to the size of the drinks based on the timeframe. The study looked at the how often coolers were consumed over the last 12 months. The majority of individuals did not answer these questions so I will need to take this into account. The majority of individuals that did consume coolers only consumed 1-2 in the last 12 months with the next largest consumption was monthly. Since the majority of individuals did not consume coolers in the last 12 months the size of cooler (1-47) that was consumed was left blank in the study. Only (18%) of individuals answered this question. I am going to research these metrics to determine if family history as well as consumption has an effect on alcoholism.
0 notes
Text
The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a study that included (N=43,0093) individuals. This study included researched family history of Alcoholism. The study was conducted in 2 waves across the 50 states. The specific target of individuals surveyed were U.S. civilian non-institutionalized individuals in households and selected group quarters. The first was conducted face to face with individuals aged 18 and over. Blacks, Hispanics and adults aged 18-24 were oversampled. This was taken into account. The second wave was a re-interview of (N=34,653) of the individuals interviewed in wave 1. I am using this study data to determine if family history influences alcoholism. I am going to be looking into family members specifically natural mothers, fathers, brothers and sisters to determine if it can be hereditary.
0 notes
Text
The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a survey conducted face to face with (N=43,093) individuals. The purpose of the survey is to gather detailed measurements of drug and alcohol issues in the US population. The study was conducted in 2 waves across the 50 US states. The first wave took place in 2001 – 2002. The second wave which was a re-interview of (N=34.653) took place in 2004-2005.
0 notes
Text
The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a study that included (N=43,0093) individuals. This study included researched family history of Alcoholism. I am using this study data to determine if family history influences alcoholism. The study shows that the majority approximately (76%) of natural fathers were not alcoholics or problem drinkers. Even less natural mothers approximately (94%) were not alcoholics or problem drinkers. Full Brothers and Sisters were also included in the study. Approximately (88%) of full brothers did not have alcohol issues and 94% of full sisters did not have alcohol issues. I combined the family members into a new grouping I called Family this way I can better research how one member compares to the entire family. I am also looking into the quantity (1-98) of alcoholic drinks consumed compared to the size of the drinks based on the timeframe. The study looked at the how often coolers were consumed over the last 12 months. The majority of individuals did not answer these questions so I will need to take this into account. The majority of individuals that did consume coolers only consumed 1-2 in the last 12 months with the next largest consumption was monthly. Since the majority of individuals did not consume coolers in the last 12 months the size of cooler (1-47) that was consumed was left blank in the study. Only (18%) of individuals answered this question. I am going to research these metrics to determine if family history as well as consumption has an effect on alcoholism.
0 notes
Text
Below is the program where I am calculating correlation between the number of coolers one drinks and the size of the coolers. As shown in the scatterplot there is a positive correlation between the size of the cooler and the amount but then the correlation appears to drop off. Since there is low correlation value and a higher P value this explains the very slight positive correlation.
Program:
# -*- coding: utf-8 -*-
"""
Spyder Editor
"""
#Justin K
#NESARC Family History of Alcoholism
import pandas
import numpy
import scipy.stats
import seaborn
import scipy
import matplotlib.pyplot as plt
#Set PANDAS to show all columns in DataFrame
#pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
#pandas.set_option('display.max_rows', None)
data = pandas.read_csv('nesarc_pds.csv', low_memory = False)
#setting to numeric
data['S2AQ4CR'] = pandas.to_numeric(data['S2AQ4CR'], errors='coerce')
data['S2AQ4D'] = pandas.to_numeric(data['S2AQ4D'], errors='coerce')
#remove unknown data
data['S2AQ4CR']=data['S2AQ4CR'].replace(99, numpy.nan)
data['S2AQ4D']=data['S2AQ4D'].replace(99, numpy.nan)
scat1 = seaborn.regplot(x="S2AQ4CR", y="S2AQ4D", fit_reg=True, data=data)
plt.xlabel('Cooler Size')
plt.ylabel('Number of coolers consumed')
plt.title('Scatterplot for the association between cooler size and the number of coolers consumed')
data_clean=data.dropna()
print ('Association between cooler size and the number of coolers consumed')
print (scipy.stats.pearsonr(data_clean['S2AQ4CR'], data_clean['S2AQ4D']))
#upper-case all DataFrame column names
data.columns = list(map(str.upper, data.columns))
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Results:
Association between cooler size and the number of coolers consumed
(0.09628360010686936, 0.12971994878189647)
0 notes
Text
Below is the program I wrote to compare family members that have issues with alcohol and individuals that suffer alcohol abuse or dependence within the last 12 months. The table ALCABDEP12DX has the values 0 = No Dependenc, 1 = Alcohol Abuse only, 2 = Alcohol Dependence Only, 3 = Alcohol Abuse and Dependence. The Family category is 1 = Father, 2 = Mother, 3 = Brother, 4 = Sister. I am going to reject the null hypothesis due to the P value resuts of the tests below.
Program:
# -*- coding: utf-8 -*-
"""
Spyder Editor
"""
#Justin K
#NESARC Family History of Alcoholism
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
data = pandas.read_csv('nesarc_pds.csv', low_memory = False)
#setting to numeric
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'])
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'])
data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'])
data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'])
data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'])
data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'])
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')
data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce')
data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce')
data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'], errors='coerce')
data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'], errors='coerce')
#remove unknown data
data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)
data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)
data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)
data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)
data['S2DQ5C1']=data['S2DQ5C1'].replace(99, numpy.nan)
data['S2DQ6C1']=data['S2DQ6C1'].replace(99, numpy.nan)
#family member Alcoholic variable, categorical 1 through 4
def Family (row):
if row['S2DQ1'] == 1 :
return 1
if row['S2DQ2'] == 1:
return 2
if row['S2DQ3C2'] == 1:
return 3
if row['S2DQ4C2'] == 1:
return 4
data['Family'] = data.apply (lambda row: Family (row),axis=1)
#count of number of parent alcoholics
data['NUMPARENT']=data['S2DQ1'] + data['S2DQ2']
sub1 = data[['Family', 'ALCABDEP12DX']].dropna()
# contingency table
ct1=pandas.crosstab(sub1['ALCABDEP12DX'], sub1['Family'])
print (ct1)
# percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
# chi-square
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
sub1["ALCABDEP12DX"] = sub1["ALCABDEP12DX"].astype('category')
# new code for setting variables to numeric:
sub1['Family'] = pandas.to_numeric(sub1['Family'], errors='coerce')
# graph
seaborn.factorplot(x="ALCABDEP12DX", y="Family", data=sub1, kind="bar", ci=None)
plt.xlabel('Dependence in last 12 months')
plt.ylabel('Family Member')
recode1 = {0: 0, 1: 1}
sub1['COMP1v1']= sub1['ALCABDEP12DX'].map(recode1)
# contingency table 1
ct1=pandas.crosstab(sub1['Family'], sub1['COMP1v1'])
print (ct1)
# percentages 1
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
recode2 = {0: 0, 2: 2}
sub1['COMP1v2']= sub1['ALCABDEP12DX'].map(recode2)
# contingency table 2
ct2=pandas.crosstab(sub1['Family'], sub1['COMP1v2'])
print (ct2)
# percentages 2
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
recode3 = {0: 0, 3: 3}
sub1['COMP1v3']= sub1['ALCABDEP12DX'].map(recode3)
# contingency table 3
ct3=pandas.crosstab(sub1['Family'], sub1['COMP1v3'])
print (ct3)
# percentages 3
colsum=ct3.sum(axis=0)
colpct=ct3/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)
#upper-case all DataFrame column names
data.columns = list(map(str.upper, data.columns))
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Result:
Family 1.0 2.0 3.0 4.0
ALCABDEP12DX
0 7118 904 3249 818
1 481 68 154 42
2 168 27 42 9
3 357 47 63 21
Family 1.0 2.0 3.0 4.0
ALCABDEP12DX
0 0.876169 0.864245 0.926169 0.919101
1 0.059207 0.065010 0.043900 0.047191
2 0.020679 0.025813 0.011973 0.010112
3 0.043944 0.044933 0.017959 0.023596
chi-square value, p value, expected counts
(90.82581721383129, 1.1115172651424374e-15, 9, array([[7238.43130896, 931.9792158 , 3125.60524764, 792.98422759],
[ 446.07753538, 57.43440448, 192.61939858, 48.86866156],
[ 147.29540094, 18.96491745, 63.60318396, 16.13649764],
[ 292.19575472, 37.62146226, 126.17216981, 32.01061321]]))
COMP1v1 0.0 1.0
Family
1.0 7118 481
2.0 904 68
3.0 3249 154
4.0 818 42
COMP1v1 0.0 1.0
Family
1.0 0.588800 0.645638
2.0 0.074779 0.091275
3.0 0.268757 0.206711
4.0 0.067665 0.056376
chi-square value, p value, expected counts
(17.873277986272203, 0.00046712719994895926, 3, array([[7157.88616176, 441.11383824],
[ 915.57643759, 56.42356241],
[3205.45948262, 197.54051738],
[ 810.07791803, 49.92208197]]))
COMP1v2 0.0 2.0
Family
1.0 7118 168
2.0 904 27
3.0 3249 42
4.0 818 9
COMP1v2 0.0 2.0
Family
1.0 0.588800 0.682927
2.0 0.074779 0.109756
3.0 0.268757 0.170732
4.0 0.067665 0.036585
chi-square value, p value, expected counts
(19.680798192577477, 0.00019765989591122253, 3, array([[7140.69347385, 145.30652615],
[ 912.4328334 , 18.5671666 ],
[3225.36676125, 65.63323875],
[ 810.5069315 , 16.4930685 ]]))
COMP1v3 0.0 3.0
Family
1.0 7118 357
2.0 904 47
3.0 3249 63
4.0 818 21
COMP1v3 0.0 3.0
Family
1.0 0.588800 0.731557
2.0 0.074779 0.096311
3.0 0.268757 0.129098
4.0 0.067665 0.043033
chi-square value, p value, expected counts
(57.96858821503683, 1.5963364821312778e-12, 3, array([[7184.9626302 , 290.0373698 ],
[ 914.10026238, 36.89973762],
[3183.49113461, 128.50886539],
[ 806.44597281, 32.55402719]]))
0 notes
Text
Below is an analysis of variance for assignment 1. In this analysis I used my original research question family history and alcoholism to determine this analysis. I chose family which is a categorized set I created listing father, mother, brother, and sister. I also used consumer which states if the person is a current or past drinker or has been a lifetime abstainer of alcoholic drinks. Since the P value In the OLS output has a very low value I can reject the null hypothesis value. I also included the mean and standard deviation values as well.
Code:
# -*- coding: utf-8 -*-
"""
Spyder Editor
"""
#Justin K
#NESARC Family History of Alcoholism
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
data = pandas.read_csv('nesarc_pds.csv', low_memory = False)
#setting to numeric
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'])
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'])
data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'])
data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'])
data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'])
data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'])
#remove unknown data
data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)
data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)
data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)
data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)
data['S2DQ5C1']=data['S2DQ5C1'].replace(99, numpy.nan)
data['S2DQ6C1']=data['S2DQ6C1'].replace(99, numpy.nan)
#family member Alcoholic variable, categorical 1 through 4
def Family (row):
if row['S2DQ1'] == 1 :
return 1
if row['S2DQ2'] == 1:
return 2
if row['S2DQ3C2'] == 1:
return 3
if row['S2DQ4C2'] == 1:
return 4
data['Family'] = data.apply (lambda row: Family (row),axis=1)
#count of number of parent alcoholics
data['NUMPARENT']=data['S2DQ1'] + data['S2DQ2']
model1 = smf.ols(formula='Family ~ C(CONSUMER)', data=data)
results1 = model1.fit()
print (results1.summary())
sub1 = data[['Family', 'CONSUMER']].dropna()
print ('means for Alcohol consumption')
mean= sub1.groupby('CONSUMER').mean()
print (mean)
print ('standard deviations for Alcohol consumption')
sd1 = sub1.groupby('CONSUMER').std()
print (sd1)
mc1 = multi.MultiComparison(data['Family'], data['CONSUMER'])
result = mc1.tukeyhsd()
print(result.summary())
#upper-case all DataFrame column names
data.columns = list(map(str.upper, data.columns))
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Output:
OLS Regression Results
==============================================================================
Dep. Variable: Family R-squared: 0.008
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 54.00
Date: Fri, 27 Nov 2020 Prob (F-statistic): 4.36e-24
Time: 18:32:36 Log-Likelihood: -19695.
No. Observations: 13568 AIC: 3.940e+04
Df Residuals: 13565 BIC: 3.942e+04
Df Model: 2
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.7350 0.011 158.687 0.000 1.714 1.756
C(CONSUMER)[T.2] 0.0963 0.022 4.328 0.000 0.053 0.140
C(CONSUMER)[T.3] 0.2706 0.027 10.125 0.000 0.218 0.323
==============================================================================
Omnibus: 3369.290 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1845.921
Skew: 0.771 Prob(JB): 0.00
Kurtosis: 2.057 Cond. No. 3.30
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
means for Alcohol consumption
Family
CONSUMER
1 1.734998
2 1.831338
3 2.005568
standard deviations for Alcohol consumption
Family
CONSUMER
1 1.014713
2 1.047685
3 1.099999
Multiple Comparison of Means - Tukey HSD, FWER=0.05
================================================
group1 group2 meandiff p-adj lower upper reject
------------------------------------------------
1 2 nan 0.5566 nan nan False
1 3 nan 0.5566 nan nan False
2 3 nan 0.5566 nan nan False
------------------------------------------------
0 notes
Text
Below is the program I data managed. Each one of my data selections contained unknown data which is unusable for my research and also adds data which makes the data more complex than it is needed. I removed all of the unknown data from each of my selections. To also assist in my research I created a new variable to combine all of the family counts. By doing this I was able to output the counts of parents and siblings which are alcoholics. I also used this new variable to display the percentages of parents and siblings which are alcoholics.
PROGRAM:
# -*- coding: utf-8 -*-
"""
Spyder Editor
"""
#Justin K
#NESARC Family History of Alcoholism
import pandas
import numpy
data = pandas.read_csv('nesarc_pds.csv', low_memory = False)
#setting to numeric
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'])
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'])
data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'])
data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'])
data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'])
data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'])
#remove unknown data
data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)
data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)
data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)
data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)
data['S2DQ5C1']=data['S2DQ5C1'].replace(99, numpy.nan)
data['S2DQ6C1']=data['S2DQ6C1'].replace(99, numpy.nan)
#family member Alcoholic variable, categorical 1 through 4
def Family (row):
if row['S2DQ1'] == 1 :
return 1
if row['S2DQ2'] == 1:
return 2
if row['S2DQ3C2'] == 1:
return 3
if row['S2DQ4C2'] == 1:
return 4
data['Family'] = data.apply (lambda row: Family (row),axis=1)
#count of number of parent alcoholics
data['NUMPARENT']=data['S2DQ1'] + data['S2DQ2']
#counts and percentages with titles and drop NAN data
print ('counts of natural fathers ever alcoholic or problem drinker 1=yes 2=No (S2DQ1)')
c1 = data['S2DQ1'].value_counts(sort=False, dropna=True)
print (c1)
print ('percentages of natural fathers ever alcoholic or problem drinker 1=yes 2=No (S2DQ1)')
p1 = data['S2DQ1'].value_counts(sort=False, normalize=True, dropna=True)
print (p1)
print ('counts of natural mothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ2)')
c2 = data['S2DQ2'].value_counts(sort=False, dropna=True)
print(c2)
print ('percentages of natural mothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ2)')
p2 = data['S2DQ2'].value_counts(sort=False, normalize=True, dropna=True)
print (p2)
print ('counts of natural brothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ3C2)')
c3 = data['S2DQ3C2'].value_counts(sort=False, dropna=True)
print(c3)
print ('percentages of natural brothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ3C2)')
p3 = data['S2DQ3C2'].value_counts(sort=False, normalize=True)
print (p3)
print ('counts of natural sisters ever alcoholic or problem drinker 1=yes 2=No (S2DQ4C2)')
c4 = data['S2DQ4C2'].value_counts(sort=False, dropna=True)
print(c4)
print ('percentages of natural sisters ever alcoholic or problem drinker 1=yes 2=No (S2DQ4C2)')
p4 = data['S2DQ4C2'].value_counts(sort=False, dropna=False, normalize=True)
print (p4)
print ('counts of natural sons ever alcoholic or problem drinker (S2DQ5C1)')
c5 = data['S2DQ5C1'].value_counts(sort=True, dropna=True)
print(c5)
print ('percentages of natural sons ever alcoholic or problem drinker (S2DQ5C1)')
p5 = data['S2DQ5C1'].value_counts(sort=True, dropna=True, normalize=True)
print (p5)
print ('counts of natural daughters ever alcoholic or problem drinker (S2DQ6C1)')
c6 = data['S2DQ6C1'].value_counts(sort=True, dropna=True)
print(c6)
print ('percentages of natural daughters ever alcoholic or problem drinker (S2DQ6C1)')
p6 = data['S2DQ6C1'].value_counts(sort=True, dropna=True, normalize=True)
print (p6)
#counts and percentage for family member variable created above
print ('Family Counts with Alcolholic 1=Father 2=Mother 3=Brother 4=Sister')
c7 = data['Family'].value_counts(sort=True, dropna=True)
print (c7)
print ('Family Percentage with Alcolholic 1=Father 2=Mother 3=Brother 4=Sister')
p7 = data['Family'].value_counts(sort=True, dropna=True, normalize=True)
print (p7)
#upper-case all DataFrame column names
data.columns = list(map(str.upper, data.columns))
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
OUTPUT:
counts of natural fathers ever alcoholic or problem drinker 1=yes 2=No (S2DQ1)
1.000000 8124
2.000000 32445
Name: S2DQ1, dtype: int64
percentages of natural fathers ever alcoholic or problem drinker 1=yes 2=No (S2DQ1)
1.000000 0.200251
2.000000 0.799749
Name: S2DQ1, dtype: float64
counts of natural mothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ2)
2.000000 39553
1.000000 2311
Name: S2DQ2, dtype: int64
percentages of natural mothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ2)
2.000000 0.944797
1.000000 0.055203
Name: S2DQ2, dtype: float64
counts of natural brothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ3C2)
1.000000 6213
2.000000 35766
Name: S2DQ3C2, dtype: int64
percentages of natural brothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ3C2)
1.000000 0.148003
2.000000 0.851997
Name: S2DQ3C2, dtype: float64
counts of natural sisters ever alcoholic or problem drinker 1=yes 2=No (S2DQ4C2)
2.000000 39381
1.000000 2707
Name: S2DQ4C2, dtype: int64
percentages of natural sisters ever alcoholic or problem drinker 1=yes 2=No (S2DQ4C2)
2.000000 0.913861
1.000000 0.062818
nan 0.023322
Name: S2DQ4C2, dtype: float64
counts of natural sons ever alcoholic or problem drinker (S2DQ5C1)
0.000000 40412
1.000000 1286
2.000000 406
3.000000 37
4.000000 11
6.000000 4
5.000000 2
7.000000 1
9.000000 1
Name: S2DQ5C1, dtype: int64
percentages of natural sons ever alcoholic or problem drinker (S2DQ5C1)
0.000000 0.958539
1.000000 0.030503
2.000000 0.009630
3.000000 0.000878
4.000000 0.000261
6.000000 0.000095
5.000000 0.000047
7.000000 0.000024
9.000000 0.000024
Name: S2DQ5C1, dtype: float64
counts of natural daughters ever alcoholic or problem drinker (S2DQ6C1)
0.000000 41347
1.000000 555
2.000000 316
3.000000 8
4.000000 4
9.000000 2
5.000000 1
Name: S2DQ6C1, dtype: int64
percentages of natural daughters ever alcoholic or problem drinker (S2DQ6C1)
0.000000 0.979021
1.000000 0.013141
2.000000 0.007482
3.000000 0.000189
4.000000 0.000095
9.000000 0.000047
5.000000 0.000024
Name: S2DQ6C1, dtype: float64
Family Counts with Alcolholic 1=Father 2=Mother 3=Brother 4=Sister
1.000000 8124
3.000000 3508
2.000000 1046
4.000000 890
Name: Family, dtype: int64
Family Percentage with Alcolholic 1=Father 2=Mother 3=Brother 4=Sister
1.000000 0.598762
3.000000 0.258550
2.000000 0.077093
4.000000 0.065596
Name: Family, dtype: float64
0 notes
Text
Below is the program I wrote to start my research on family history with alcoholism. In this initial program I calculated both the counts and percentages of survey results showing the alcoholic or drinking problems. These are separated by natural mother, father, brother and sister. I also did counts and percentages of the counts of sons and daughters who are having issues with alcohol. Since there is the possibility of missing data on the number of children I am displaying missing data using "dropna=False"
Program:
# -*- coding: utf-8 -*-
"""
Spyder Editor
"""
#Justin K
#NESARC Family History of Alcoholism
import pandas
import numpy
data = pandas.read_csv('nesarc_pds.csv', low_memory = False)
#setting to numeric
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'])
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'])
data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'])
data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'])
data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'])
data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'])
#counts and percentages with titles and not dropping missing data
print ('counts of natural fathers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ1)')
c1 = data['S2DQ1'].value_counts(sort=False, dropna=False)
print (c1)
print ('percentages of natural fathers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ1)')
p1 = data['S2DQ1'].value_counts(sort=False, normalize=True, dropna=False)
print (p1)
print ('counts of natural mothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ2)')
c2 = data['S2DQ2'].value_counts(sort=False, dropna=False)
print(c2)
print ('percentages of natural mothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ2)')
p2 = data['S2DQ2'].value_counts(sort=False, normalize=True, dropna=False)
print (p2)
print ('counts of natural brothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ3C2)')
c3 = data['S2DQ3C2'].value_counts(sort=False, dropna=False)
print(c3)
print ('percentages of natural brothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ3C2)')
p3 = data['S2DQ3C2'].value_counts(sort=False, normalize=True)
print (p3)
print ('counts of natural sisters ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ4C2)')
c4 = data['S2DQ4C2'].value_counts(sort=False, dropna=False)
print(c4)
print ('percentages of natural sisters ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ4C2)')
p4 = data['S2DQ4C2'].value_counts(sort=False, dropna=False, normalize=True)
print (p4)
print ('counts of natural sons ever alcoholic or problem drinker (S2DQ5C1)')
c5 = data['S2DQ5C1'].value_counts(sort=False, dropna=False)
print(c5)
print ('percentages of natural sons ever alcoholic or problem drinker (S2DQ5C1)')
p5 = data['S2DQ5C1'].value_counts(sort=False, dropna=False, normalize=True)
print (p5)
print ('counts of natural daughters ever alcoholic or problem drinker (S2DQ6C1)')
c6 = data['S2DQ6C1'].value_counts(sort=False, dropna=False)
print(c6)
print ('percentages of natural daughters ever alcoholic or problem drinker (S2DQ6C1)')
p6 = data['S2DQ6C1'].value_counts(sort=False, dropna=False, normalize=True)
print (p6)
#upper-case all DataFrame column names
data.columns = list(map(str.upper, data.columns))
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Output:
counts of natural fathers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ1)
1 8124
2 32445
9 2524
Name: S2DQ1, dtype: int64
percentages of natural fathers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ1)
1 0.188522
2 0.752907
9 0.058571
Name: S2DQ1, dtype: float64
counts of natural mothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ2)
1 2311
2 39553
9 1229
Name: S2DQ2, dtype: int64
percentages of natural mothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ2)
1 0.053628
2 0.917852
9 0.028520
Name: S2DQ2, dtype: float64
counts of natural brothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ3C2)
1 6213
2 35766
9 1114
Name: S2DQ3C2, dtype: int64
percentages of natural brothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ3C2)
1 0.144177
2 0.829972
9 0.025851
Name: S2DQ3C2, dtype: float64
counts of natural sisters ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ4C2)
1 2707
2 39381
9 1005
Name: S2DQ4C2, dtype: int64
percentages of natural sisters ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ4C2)
1 0.062818
2 0.913861
9 0.023322
Name: S2DQ4C2, dtype: float64
counts of natural sons ever alcoholic or problem drinker (S2DQ5C1)
0 40412
1 1286
2 406
3 37
99 933
4 11
5 2
6 4
7 1
9 1
Name: S2DQ5C1, dtype: int64
percentages of natural sons ever alcoholic or problem drinker (S2DQ5C1)
0 0.937786
1 0.029842
2 0.009421
3 0.000859
99 0.021651
4 0.000255
5 0.000046
6 0.000093
7 0.000023
9 0.000023
Name: S2DQ5C1, dtype: float64
counts of natural daughters ever alcoholic or problem drinker (S2DQ6C1)
0 41347
1 555
2 316
3 8
99 860
4 4
5 1
9 2
Name: S2DQ6C1, dtype: int64
percentages of natural daughters ever alcoholic or problem drinker (S2DQ6C1)
0 0.959483
1 0.012879
2 0.007333
3 0.000186
99 0.019957
4 0.000093
5 0.000023
9 0.000046
Name: S2DQ6C1, dtype: float64
0 notes