turbomoparboy - Tumblr blog

turbomoparboy · 4 years ago

Text

k-means cluster analysis

Below is the k-means cluster analysis I completed on family history of alcoholism using data from the NESARC dataset. I choose the following clustering variables Alcoholic father, mother, brother, sister, son, daughter, married to an alcoholic and currently living with an alcoholic. The first part is the k-means cluster analysis with a range set at (1,10) 1-9 clusters using the elbow method. Below is the plot displaying this output. Looking at the plot you will see the curve between clusters 2 and 4 appear as if it is averaging out but then there is another curve between clusters 5 and 7 where it can be averaging out as well.

The next part of the analysis was running a canonical analysis in this plot only 3 clusters were displayed KMeans(n_clusters=3). This includes the 2 canonical variables. These are displayed in order by the proportion of variance. You can see how the dark blue cluster is for the most part packed low in the variance but has a few spread out points. The yellow is higher in the variance and not as packed together and the green is the highest with only 2 points but the highest.

Using analysis of variance to check if there are significant differences between clusters. Drink in the last 12 months is the response variable and cluster is the explanatory variable. Looking at the analysis of variance table you will see the clusters did not differ significantly with drinking in last 12 months.

Clustering variable means by cluster index S2DQ1 S2DQ2 ... S2DQ6C2 S2DQ19 S2DQ21 cluster ... 0 22322.458333 3.798367 0.754710 ... 0.0 -0.249312 0.062684 1 20744.190332 -0.229293 -0.109190 ... 0.0 0.104510 0.014298 2 9929.000000 3.798367 5.894922 ... 0.0 -1.391897 0.062684

[3 rows x 9 columns] OLS Regression Results ============================================================================== Dep. Variable: S2AQ3 R-squared: 0.006 Model: OLS Adj. R-squared: 0.000 Method: Least Squares F-statistic: 1.020 Date: Thu, 11 Mar 2021 Prob (F-statistic): 0.362 Time: 09:09:47 Log-Likelihood: -58.230 No. Observations: 357 AIC: 122.5 Df Residuals: 354 BIC: 134.1 Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 1.1667 0.058 19.981 0.000 1.052 1.281 C(cluster)[T.1] -0.0821 0.060 -1.357 0.176 -0.201 0.037 C(cluster)[T.2] -0.1667 0.211 -0.792 0.429 -0.581 0.247 ============================================================================== Omnibus: 219.273 Durbin-Watson: 2.003 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1055.340 Skew: 2.850 Prob(JB): 6.85e-230 Kurtosis: 9.201 Cond. No. 19.3 ==============================================================================

means for Drink in last 12 months by cluster S2AQ3 cluster 0 1.166667 1 1.084592 2 1.000000 standard deviations for Drink in last 12 months by cluster S2AQ3 cluster 0 0.380693 1 0.278695 2 0.000000 Multiple Comparison of Means - Tukey HSD, FWER=0.05 =================================================== group1 group2 meandiff p-adj lower upper reject --------------------------------------------------- 0 1 -0.0821 0.3657 -0.2244 0.0603 False 0 2 -0.1667 0.6912 -0.6622 0.3288 False 1 2 -0.0846 0.9 -0.5621 0.3929 False ---------------------------------------------------

Code: # -*- coding: utf-8 -*- """ Created on Tue Mar 9 14:03:32 2021

@author: Justin

"""

from pandas import Series, DataFrame

import pandas as pd import numpy as np import matplotlib.pylab as plt #from sklearn.cross_validation import train_test_split from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans

""" Data Management """ data = pd.read_csv("nesarc_pds.csv")

#upper-case all DataFrame column names data.columns = map(str.upper, data.columns)

# Data Management

data_clean = data.dropna()

cluster=data_clean[['S2DQ1', 'S2DQ2','S2DQ3C2', 'S2DQ4C2', 'S2DQ5C2','S2DQ6C2','S2DQ19', 'S2DQ21' ]]

cluster.describe() clustervar=cluster.copy() clustervar['S2DQ1']=preprocessing.scale(clustervar['S2DQ1'].astype('float64')) clustervar['S2DQ2']=preprocessing.scale(clustervar['S2DQ2'].astype('float64')) clustervar['S2DQ3C2']=preprocessing.scale(clustervar['S2DQ3C2'].astype('float64')) clustervar['S2DQ4C2']=preprocessing.scale(clustervar['S2DQ4C2'].astype('float64')) clustervar['S2DQ5C2']=preprocessing.scale(clustervar['S2DQ5C2'].astype('float64')) clustervar['S2DQ6C2']=preprocessing.scale(clustervar['S2DQ6C2'].astype('float64')) clustervar['S2DQ19']=preprocessing.scale(clustervar['S2DQ19'].astype('float64')) clustervar['S2DQ21']=preprocessing.scale(clustervar['S2DQ21'].astype('float64'))

# split data into train and test sets

clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)

# k-means cluster analysis for 1-9 clusters

from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]

for k in clusters:

model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])

plt.plot(clusters, meandist)

plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')

model3=KMeans(n_clusters=3)

model3.fit(clus_train) clusassign=model3.predict(clus_train)

from sklearn.decomposition import PCA

pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()

clus_train.reset_index(level=0, inplace=True)

cluslist=list(clus_train['index']) labels=list(model3.labels_) newlist=dict(zip(cluslist, labels)) newlist

newclus=DataFrame.from_dict(newlist, orient='index')

newclus newclus.columns = ['cluster'] newclus.reset_index(level=0, inplace=True) merged_train=pd.merge(clus_train, newclus, on='index') merged_train.head(n=100) merged_train.cluster.value_counts() clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)

drink_data=data_clean['S2AQ3']

drink_train, gpa_test = train_test_split(drink_data, test_size=.3, random_state=123) drink_train1=pd.DataFrame(drink_train) drink_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(drink_train1, merged_train, on='index') sub1 = merged_train_all[['S2AQ3', 'cluster']].dropna()

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

drinkmod = smf.ols(formula='S2AQ3 ~ C(cluster)', data=sub1).fit()

print (drinkmod.summary())

print ('means for Drink in last 12 months by cluster')

means1= sub1.groupby('cluster').mean() print (means1)

print ('standard deviations for Drink in last 12 months by cluster')

means2= sub1.groupby('cluster').std() print (means2)

mc1 = multi.MultiComparison(sub1['S2AQ3'], sub1['cluster'])

res1 = mc1.tukeyhsd() print(res1.summary())

0 notes

turbomoparboy · 4 years ago

Text

Lasso Regresson

Below is lasso regression I ran on family history and Alcoholism. I choose adoptive father as the target and for the predictor variables I choose natural family members (Mother, father, brother, sister, grandmother and grandfather). The code uses least angle regression and a k fold cross validation where (k=10) 10 random folds are used to choose the statistical model. Predictors with regression values of 0 after applying lasso penalty due to shrinkage were removed. Out of the 8 variables only 2 were selected in the final model, 6 of the variables were removed. The variables selected were natural Mother and full sisters.

S2DQ1 0.0

S2DQ3C2 0.0 S2DQ5C2 0.0 S2DQ6C2 0.0 S2DQ19 0.0 S2DQ21 0.0 S2DQ2 -0.004855304992687545 S2DQ4C2 -0.052779550407669074

None of the variables have a strong positive association with adoptive father but natural mother and full sister are negatively associated with adoptive father. Natural mother has a negative association of (-0.005) and full sister (-0.05). All of this is displayed on the Regression Coefficients Progression for Lasso Paths plot. In this plot you can also see when the variables were added with red being natural mother and orange being full sister.

The next plot is the Mean squared error on each fold plot. In this plot you can see there is an initial change in the mean square error. The variables do not all follow the same pattern as some quickly increase while others quickly decrease. You can see how quickly this plot levels off and adding more predictors does effect the mean squared error.

The test mean square error of 0.008 and training mean squared error of 0.37 is close indicating stability.

training data MSE 0.007921015697569204 test data MSE 0.3671288141656346

The Training data R-square value of 0.29 and the test R-squared value of 0.14. This means it explains 29% and 14% of the variable.

training data R-square 0.2850385767425656 test data R-square 0.13606823548436742

Code:

#Justin K

#NESARC Family History of Alcoholism

# -*- coding: utf-8 -*-

from pandas import Series, DataFrame

import pandas as pd

import numpy as np

import os

import matplotlib.pylab as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LassoLarsCV

#os.chdir("C:\Trees")

#Data and Cleaning

data = pd.read_csv("nesarc_pds.csv")

#upper-case all DataFrame column names data.columns = map(str.upper, data.columns) data_clean = data.dropna()

data_clean.dtypes

data_clean.describe()

target = data_clean ['S2DQ14A']

#select predictor variables and target variable as separate data sets6 predvar= data_clean[['S2DQ1', 'S2DQ2','S2DQ3C2', 'S2DQ4C2', 'S2DQ5C2','S2DQ6C2','S2DQ19', 'S2DQ21' ]] # standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing predictors['S2DQ1']=preprocessing.scale(predictors['S2DQ1'].astype('float64')) predictors['S2DQ2']=preprocessing.scale(predictors['S2DQ2'].astype('float64')) predictors['S2DQ3C2']=preprocessing.scale(predictors['S2DQ3C2'].astype('float64')) predictors['S2DQ4C2']=preprocessing.scale(predictors['S2DQ4C2'].astype('float64')) predictors['S2DQ5C2']=preprocessing.scale(predictors['S2DQ5C2'].astype('float64')) predictors['S2DQ6C2']=preprocessing.scale(predictors['S2DQ6C2'].astype('float64')) predictors['S2DQ19']=preprocessing.scale(predictors['S2DQ19'].astype('float64')) predictors['S2DQ21']=preprocessing.scale(predictors['S2DQ21'].astype('float64'))

# split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3,random_state=123)

# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

res_dict = dict(zip(predictors.columns, model.coef_)) pred_dict = dict(sorted(res_dict.items(), key=lambda x: abs(x[1]))) print(pred_dict)

# plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.mse_path_, ':') plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')

# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)

# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)

0 notes

turbomoparboy · 4 years ago

Text

Classification Tree

Summary:

Below is the classification tree I created using graphviz and pydotplus. As discussed in the class video due to tree’s being too large to easily understand and the inability to prune I only used 2 explanatory variables in this assignment as my original was too large. The training sample has 25,855 rows and 2 explanatory variables and the Test sample has 17,238 rows and 2 explanatory variables. This is using the target (S2AQ3 – “had a drink in the last 12 months”) The tree starts with a split on the 1st (x0 – Alcoholic Father) explanatory variable. If <= 1.5 , which is true, you would move to the left to my 2nd (x1 – Alcoholic Mother) explanatory variable, which is 4,952 out of the 25,855 samples. You can continue down the tree. Now if we go back to the first split of the 1st (x0 – Alcoholic Father) explanatory variable. If > 1.5, this would be false, you would move to the right to the 2nd (x1 – Alcoholic Mother) explanatory variable, which is 20,903 out of the 25,855 samples. Following the same procedure we can continue down this path of the decision tree.

Code:

#Justin K

#NESARC Family History of Alcoholism

# -*- coding: utf-8 -*-

from pandas import Series, DataFrame

import pandas as pd

import numpy as np

import os

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

import sklearn.metrics

os.chdir("C:\Trees")

#Data and Cleaning

data = pd.read_csv("nesarc_pds.csv")

data_clean = data.dropna()

data_clean.dtypes

data_clean.describe()

predictors = data_clean[['S2DQ1','S2DQ2']]

targets = data_clean['S2AQ3']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape

pred_test.shape

tar_train.shape

tar_test.shape

print (pred_train.shape)

print (pred_test.shape)

print (tar_train.shape)

print (tar_test.shape)

#Build model

classifier=DecisionTreeClassifier()

classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)

sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree

from sklearn import tree

from io import StringIO

from IPython.display import Image

out = StringIO()

tree.export_graphviz(classifier, out_file=out)

import pydotplus

print(out.getvalue())

graph=pydotplus.graph_from_dot_data(out.getvalue())

# Image

Img=(Image(graph.create_png()))

display(Img)

Output:

(25855, 2)

(17238, 2)

(25855,)

(17238,)

0 notes

turbomoparboy · 4 years ago

Text

Random Forest ** Classification tree assignment is in previous post below**

I performed a random forest analysis below. Like the decision tree I used the explanatory variables of alcoholic father, alcoholic mother, alcoholic brother and alcoholic sister. And if these have had drinks in the last 12 months. Alcoholic father had the highest importance score of 0.478 and Alcoholic brother had the least with 0.14. The accuracy of the random forest of 15 is 0.907 or rounded to 91%. Adding trees increased accuracy but only a small amount (approximately 2%) with 2 trees. Below is the code used to generate the random forest.

# -*- coding: utf-8 -*-

"""

Spyder Editor

"""

#Justin K

#NESARC Family History of Alcoholism

# -*- coding: utf-8 -*-

from pandas import Series, DataFrame

import pandas as pd

import numpy as np

import os

import matplotlib.pylab as plt

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

import sklearn.metrics

from sklearn import datasets

from sklearn.ensemble import ExtraTreesClassifier

#os.chdir("C:\Trees")

#Data and Cleaning

AH_data = pd.read_csv("nesarc_pds.csv")

data_clean = AH_data.dropna()

data_clean.dtypes

data_clean.describe()

predictors = data_clean[['S2DQ1','S2DQ2','S2DQ3C2','S2DQ4C2' ]]

targets = data_clean['S2AQ3']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape

pred_test.shape

tar_train.shape

tar_test.shape

#Build model

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=15)

classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)

sklearn.metrics.accuracy_score(tar_test, predictions)

# fit an Extra Trees model to the data

model = ExtraTreesClassifier()

model.fit(pred_train,tar_train)

# display the relative importance of each attribute

print(model.feature_importances_)

"""

Running a different number of trees and see the effect

of that on the accuracy of the prediction

"""

trees=range(15)

accuracy=np.zeros(15)

for idx in range(len(trees)):

classifier=RandomForestClassifier(n_estimators=idx + 1)

classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla()

plt.plot(trees, accuracy)

0 notes

turbomoparboy · 4 years ago

Text

Classification Tree

Below is the classification tree I created using graphviz and pydotplus. The predictors used are mother, father, brother and sister with alcohol issues. The target is had a drink in the last 12 months. But due to the complexity of the tree I removed bother and sister to better explain the tree. I posted this tree below as well. Looking at the smaller tree there were 25,855 samples. If the split value (predictor of mother or father having alcohol issues) is less than 1.5 you move to the left (true). Since 1 is yes and 2 is no this means they have an issue. This has 4,958 samples. Now if your value was greater than 1.5 you would move to the right which is false. In this situation there are 20,897 samples. You can continue to follow the tree.

Below is the code used for the larger tree ( smaller one I just removed bother and sister)

# -*- coding: utf-8 -*-

"""

Spyder Editor

"""

#Justin K

#NESARC Family History of Alcoholism

# -*- coding: utf-8 -*-

from pandas import Series, DataFrame

import pandas as pd

import numpy as np

import os

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

import sklearn.metrics

os.chdir("C:\Trees")

#Data and Cleaning

AH_data = pd.read_csv("nesarc_pds.csv")

data_clean = AH_data.dropna()

data_clean.dtypes

data_clean.describe()

predictors = data_clean[['S2DQ1','S2DQ2','S2DQ3C2','S2DQ4C2' ]]

targets = data_clean['S2AQ3']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape

pred_test.shape

tar_train.shape

tar_test.shape

#Build model

classifier=DecisionTreeClassifier()

classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)

sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree

from sklearn import tree

from io import StringIO

from IPython.display import Image

out = StringIO()

tree.export_graphviz(classifier, out_file=out)

import pydotplus

print(out.getvalue())

graph=pydotplus.graph_from_dot_data(out.getvalue())

# Image

Img=(Image(graph.create_png()))

display(Img)

Output:

digraph Tree {

node [shape=box] ;

0 [label="X[0] <= 1.5\ngini = 0.468\nsamples = 25855\nvalue = [16219, 9617, 19]"] ;

1 [label="X[2] <= 1.5\ngini = 0.434\nsamples = 4904\nvalue = [3343, 1560, 1]"] ;

0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;

2 [label="X[3] <= 1.5\ngini = 0.457\nsamples = 1488\nvalue = [962, 526, 0]"] ;

1 -> 2 ;

3 [label="X[1] <= 5.5\ngini = 0.475\nsamples = 334\nvalue = [204, 130, 0]"] ;

2 -> 3 ;

4 [label="X[1] <= 1.5\ngini = 0.474\nsamples = 331\nvalue = [203, 128, 0]"] ;

3 -> 4 ;

5 [label="gini = 0.476\nsamples = 105\nvalue = [64, 41, 0]"] ;

4 -> 5 ;

6 [label="gini = 0.474\nsamples = 226\nvalue = [139, 87, 0]"] ;

4 -> 6 ;

7 [label="gini = 0.444\nsamples = 3\nvalue = [1, 2, 0]"] ;

3 -> 7 ;

8 [label="X[3] <= 5.5\ngini = 0.451\nsamples = 1154\nvalue = [758, 396, 0]"] ;

2 -> 8 ;

9 [label="X[1] <= 1.5\ngini = 0.45\nsamples = 1145\nvalue = [754, 391, 0]"] ;

8 -> 9 ;

10 [label="gini = 0.434\nsamples = 195\nvalue = [133, 62, 0]"] ;

9 -> 10 ;

11 [label="X[1] <= 5.5\ngini = 0.453\nsamples = 950\nvalue = [621, 329, 0]"] ;

9 -> 11 ;

12 [label="gini = 0.453\nsamples = 941\nvalue = [615, 326, 0]"] ;

11 -> 12 ;

13 [label="gini = 0.444\nsamples = 9\nvalue = [6, 3, 0]"] ;

11 -> 13 ;

14 [label="X[1] <= 5.5\ngini = 0.494\nsamples = 9\nvalue = [4, 5, 0]"] ;

8 -> 14 ;

15 [label="gini = 0.469\nsamples = 8\nvalue = [3, 5, 0]"] ;

14 -> 15 ;

16 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;

14 -> 16 ;

17 [label="X[2] <= 5.5\ngini = 0.423\nsamples = 3416\nvalue = [2381, 1034, 1]"] ;

1 -> 17 ;

18 [label="X[1] <= 5.5\ngini = 0.421\nsamples = 3366\nvalue = [2352, 1013, 1]"] ;

17 -> 18 ;

19 [label="X[3] <= 1.5\ngini = 0.422\nsamples = 3331\nvalue = [2326, 1004, 1]"] ;

18 -> 19 ;

20 [label="X[1] <= 1.5\ngini = 0.433\nsamples = 316\nvalue = [216, 100, 0]"] ;

19 -> 20 ;

21 [label="gini = 0.426\nsamples = 78\nvalue = [54, 24, 0]"] ;

20 -> 21 ;

22 [label="gini = 0.435\nsamples = 238\nvalue = [162, 76, 0]"] ;

20 -> 22 ;

23 [label="X[1] <= 1.5\ngini = 0.42\nsamples = 3015\nvalue = [2110, 904, 1]"] ;

19 -> 23 ;

24 [label="X[3] <= 5.5\ngini = 0.416\nsamples = 385\nvalue = [272, 112, 1]"] ;

23 -> 24 ;

25 [label="gini = 0.418\nsamples = 383\nvalue = [270, 112, 1]"] ;

24 -> 25 ;

26 [label="gini = 0.0\nsamples = 2\nvalue = [2, 0, 0]"] ;

24 -> 26 ;

27 [label="X[3] <= 5.5\ngini = 0.421\nsamples = 2630\nvalue = [1838, 792, 0]"] ;

23 -> 27 ;

28 [label="gini = 0.421\nsamples = 2622\nvalue = [1833, 789, 0]"] ;

27 -> 28 ;

29 [label="gini = 0.469\nsamples = 8\nvalue = [5, 3, 0]"] ;

27 -> 29 ;

30 [label="X[3] <= 1.5\ngini = 0.382\nsamples = 35\nvalue = [26, 9, 0]"] ;

18 -> 30 ;

31 [label="gini = 0.0\nsamples = 3\nvalue = [3, 0, 0]"] ;

30 -> 31 ;

32 [label="gini = 0.404\nsamples = 32\nvalue = [23, 9, 0]"] ;

30 -> 32 ;

33 [label="X[3] <= 1.5\ngini = 0.487\nsamples = 50\nvalue = [29, 21, 0]"] ;

17 -> 33 ;

34 [label="X[1] <= 1.5\ngini = 0.278\nsamples = 6\nvalue = [5, 1, 0]"] ;

33 -> 34 ;

35 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;

34 -> 35 ;

36 [label="gini = 0.32\nsamples = 5\nvalue = [4, 1, 0]"] ;

34 -> 36 ;

37 [label="X[1] <= 1.5\ngini = 0.496\nsamples = 44\nvalue = [24, 20, 0]"] ;

33 -> 37 ;

38 [label="X[3] <= 5.5\ngini = 0.42\nsamples = 10\nvalue = [7, 3, 0]"] ;

37 -> 38 ;

39 [label="gini = 0.278\nsamples = 6\nvalue = [5, 1, 0]"] ;

38 -> 39 ;

40 [label="gini = 0.5\nsamples = 4\nvalue = [2, 2, 0]"] ;

38 -> 40 ;

41 [label="X[1] <= 5.5\ngini = 0.5\nsamples = 34\nvalue = [17, 17, 0]"] ;

37 -> 41 ;

42 [label="X[3] <= 5.5\ngini = 0.499\nsamples = 31\nvalue = [16, 15, 0]"] ;

41 -> 42 ;

43 [label="gini = 0.497\nsamples = 24\nvalue = [13, 11, 0]"] ;

42 -> 43 ;

44 [label="gini = 0.49\nsamples = 7\nvalue = [3, 4, 0]"] ;

42 -> 44 ;

45 [label="gini = 0.444\nsamples = 3\nvalue = [1, 2, 0]"] ;

41 -> 45 ;

46 [label="X[1] <= 1.5\ngini = 0.474\nsamples = 20951\nvalue = [12876, 8057, 18]"] ;

0 -> 46 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;

47 [label="X[2] <= 1.5\ngini = 0.381\nsamples = 621\nvalue = [462, 159, 0]"] ;

46 -> 47 ;

48 [label="X[0] <= 5.5\ngini = 0.426\nsamples = 133\nvalue = [92, 41, 0]"] ;

47 -> 48 ;

49 [label="X[3] <= 5.5\ngini = 0.394\nsamples = 115\nvalue = [84, 31, 0]"] ;

48 -> 49 ;

50 [label="X[3] <= 1.5\ngini = 0.388\nsamples = 114\nvalue = [84, 30, 0]"] ;

49 -> 50 ;

51 [label="gini = 0.397\nsamples = 33\nvalue = [24, 9, 0]"] ;

50 -> 51 ;

52 [label="gini = 0.384\nsamples = 81\nvalue = [60, 21, 0]"] ;

50 -> 52 ;

53 [label="gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;

49 -> 53 ;

54 [label="X[3] <= 1.5\ngini = 0.494\nsamples = 18\nvalue = [8, 10, 0]"] ;

48 -> 54 ;

55 [label="gini = 0.0\nsamples = 4\nvalue = [0, 4, 0]"] ;

54 -> 55 ;

56 [label="gini = 0.49\nsamples = 14\nvalue = [8, 6, 0]"] ;

54 -> 56 ;

57 [label="X[2] <= 5.5\ngini = 0.367\nsamples = 488\nvalue = [370, 118, 0]"] ;

47 -> 57 ;

58 [label="X[0] <= 5.5\ngini = 0.368\nsamples = 485\nvalue = [367, 118, 0]"] ;

57 -> 58 ;

59 [label="X[3] <= 1.5\ngini = 0.363\nsamples = 391\nvalue = [298, 93, 0]"] ;

58 -> 59 ;

60 [label="gini = 0.334\nsamples = 52\nvalue = [41, 11, 0]"] ;

59 -> 60 ;

61 [label="gini = 0.367\nsamples = 339\nvalue = [257, 82, 0]"] ;

59 -> 61 ;

62 [label="X[3] <= 1.5\ngini = 0.39\nsamples = 94\nvalue = [69, 25, 0]"] ;

58 -> 62 ;

63 [label="gini = 0.48\nsamples = 10\nvalue = [6, 4, 0]"] ;

62 -> 63 ;

64 [label="X[3] <= 5.5\ngini = 0.375\nsamples = 84\nvalue = [63, 21, 0]"] ;

62 -> 64 ;

65 [label="gini = 0.378\nsamples = 83\nvalue = [62, 21, 0]"] ;

64 -> 65 ;

66 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;

64 -> 66 ;

67 [label="gini = 0.0\nsamples = 3\nvalue = [3, 0, 0]"] ;

57 -> 67 ;

68 [label="X[1] <= 5.5\ngini = 0.476\nsamples = 20330\nvalue = [12414, 7898, 18]"] ;

46 -> 68 ;

69 [label="X[2] <= 5.5\ngini = 0.474\nsamples = 19625\nvalue = [12065, 7550, 10]"] ;

68 -> 69 ;

70 [label="X[2] <= 1.5\ngini = 0.474\nsamples = 19443\nvalue = [11972, 7463, 8]"] ;

69 -> 70 ;

71 [label="X[3] <= 1.5\ngini = 0.485\nsamples = 2044\nvalue = [1200, 843, 1]"] ;

70 -> 71 ;

72 [label="X[0] <= 5.5\ngini = 0.479\nsamples = 285\nvalue = [172, 113, 0]"] ;

71 -> 72 ;

73 [label="gini = 0.482\nsamples = 261\nvalue = [155, 106, 0]"] ;

72 -> 73 ;

74 [label="gini = 0.413\nsamples = 24\nvalue = [17, 7, 0]"] ;

72 -> 74 ;

75 [label="X[0] <= 5.5\ngini = 0.486\nsamples = 1759\nvalue = [1028, 730, 1]"] ;

71 -> 75 ;

76 [label="X[3] <= 5.5\ngini = 0.486\nsamples = 1686\nvalue = [988, 697, 1]"] ;

75 -> 76 ;

77 [label="gini = 0.486\nsamples = 1679\nvalue = [984, 694, 1]"] ;

76 -> 77 ;

78 [label="gini = 0.49\nsamples = 7\nvalue = [4, 3, 0]"] ;

76 -> 78 ;

79 [label="gini = 0.495\nsamples = 73\nvalue = [40, 33, 0]"] ;

75 -> 79 ;

80 [label="X[3] <= 5.5\ngini = 0.472\nsamples = 17399\nvalue = [10772, 6620, 7]"] ;

70 -> 80 ;

81 [label="X[0] <= 5.5\ngini = 0.472\nsamples = 17361\nvalue = [10745, 6609, 7]"] ;

80 -> 81 ;

82 [label="X[3] <= 1.5\ngini = 0.472\nsamples = 16719\nvalue = [10337, 6377, 5]"] ;

81 -> 82 ;

83 [label="gini = 0.478\nsamples = 505\nvalue = [306, 199, 0]"] ;

82 -> 83 ;

84 [label="gini = 0.472\nsamples = 16214\nvalue = [10031, 6178, 5]"] ;

82 -> 84 ;

85 [label="X[3] <= 1.5\ngini = 0.466\nsamples = 642\nvalue = [408, 232, 2]"] ;

81 -> 85 ;

86 [label="gini = 0.476\nsamples = 23\nvalue = [14, 9, 0]"] ;

85 -> 86 ;

87 [label="gini = 0.465\nsamples = 619\nvalue = [394, 223, 2]"] ;

85 -> 87 ;

88 [label="X[0] <= 5.5\ngini = 0.411\nsamples = 38\nvalue = [27, 11, 0]"] ;

80 -> 88 ;

89 [label="gini = 0.4\nsamples = 29\nvalue = [21, 8, 0]"] ;

88 -> 89 ;

90 [label="gini = 0.444\nsamples = 9\nvalue = [6, 3, 0]"] ;

88 -> 90 ;

91 [label="X[0] <= 5.5\ngini = 0.51\nsamples = 182\nvalue = [93, 87, 2]"] ;

69 -> 91 ;

92 [label="X[3] <= 1.5\ngini = 0.511\nsamples = 155\nvalue = [81, 72, 2]"] ;

91 -> 92 ;

93 [label="gini = 0.408\nsamples = 7\nvalue = [5, 2, 0]"] ;

92 -> 93 ;

94 [label="X[3] <= 5.5\ngini = 0.512\nsamples = 148\nvalue = [76, 70, 2]"] ;

92 -> 94 ;

95 [label="gini = 0.5\nsamples = 68\nvalue = [35, 33, 0]"] ;

94 -> 95 ;

96 [label="gini = 0.523\nsamples = 80\nvalue = [41, 37, 2]"] ;

94 -> 96 ;

97 [label="X[3] <= 1.5\ngini = 0.494\nsamples = 27\nvalue = [12, 15, 0]"] ;

91 -> 97 ;

98 [label="gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;

97 -> 98 ;

99 [label="X[3] <= 5.5\ngini = 0.497\nsamples = 26\nvalue = [12, 14, 0]"] ;

97 -> 99 ;

100 [label="gini = 0.48\nsamples = 10\nvalue = [6, 4, 0]"] ;

99 -> 100 ;

101 [label="gini = 0.469\nsamples = 16\nvalue = [6, 10, 0]"] ;

99 -> 101 ;

102 [label="X[0] <= 5.5\ngini = 0.511\nsamples = 705\nvalue = [349, 348, 8]"] ;

68 -> 102 ;

103 [label="X[3] <= 5.5\ngini = 0.483\nsamples = 76\nvalue = [45, 31, 0]"] ;

102 -> 103 ;

104 [label="X[2] <= 5.5\ngini = 0.479\nsamples = 68\nvalue = [41, 27, 0]"] ;

103 -> 104 ;

105 [label="X[2] <= 1.5\ngini = 0.481\nsamples = 67\nvalue = [40, 27, 0]"] ;

104 -> 105 ;

106 [label="X[3] <= 1.5\ngini = 0.496\nsamples = 11\nvalue = [6, 5, 0]"] ;

105 -> 106 ;

107 [label="gini = 0.444\nsamples = 3\nvalue = [2, 1, 0]"] ;

106 -> 107 ;

108 [label="gini = 0.5\nsamples = 8\nvalue = [4, 4, 0]"] ;

106 -> 108 ;

109 [label="X[3] <= 1.5\ngini = 0.477\nsamples = 56\nvalue = [34, 22, 0]"] ;

105 -> 109 ;

110 [label="gini = 0.444\nsamples = 3\nvalue = [2, 1, 0]"] ;

109 -> 110 ;

111 [label="gini = 0.478\nsamples = 53\nvalue = [32, 21, 0]"] ;

109 -> 111 ;

112 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;

104 -> 112 ;

113 [label="X[2] <= 5.5\ngini = 0.5\nsamples = 8\nvalue = [4, 4, 0]"] ;

103 -> 113 ;

114 [label="gini = 0.5\nsamples = 2\nvalue = [1, 1, 0]"] ;

113 -> 114 ;

115 [label="gini = 0.5\nsamples = 6\nvalue = [3, 3, 0]"] ;

113 -> 115 ;

116 [label="X[3] <= 1.5\ngini = 0.512\nsamples = 629\nvalue = [304, 317, 8]"] ;

102 -> 116 ;

117 [label="X[2] <= 5.5\ngini = 0.469\nsamples = 8\nvalue = [3, 5, 0]"] ;

116 -> 117 ;

118 [label="X[2] <= 1.5\ngini = 0.408\nsamples = 7\nvalue = [2, 5, 0]"] ;

117 -> 118 ;

119 [label="gini = 0.5\nsamples = 4\nvalue = [2, 2, 0]"] ;

118 -> 119 ;

120 [label="gini = 0.0\nsamples = 3\nvalue = [0, 3, 0]"] ;

118 -> 120 ;

121 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]"] ;

117 -> 121 ;

122 [label="X[2] <= 1.5\ngini = 0.512\nsamples = 621\nvalue = [301, 312, 8]"] ;

116 -> 122 ;

123 [label="X[3] <= 5.5\ngini = 0.49\nsamples = 14\nvalue = [8, 6, 0]"] ;

122 -> 123 ;

124 [label="gini = 0.486\nsamples = 12\nvalue = [7, 5, 0]"] ;

123 -> 124 ;

125 [label="gini = 0.5\nsamples = 2\nvalue = [1, 1, 0]"] ;

123 -> 125 ;

126 [label="X[3] <= 5.5\ngini = 0.513\nsamples = 607\nvalue = [293, 306, 8]"] ;

122 -> 126 ;

127 [label="X[2] <= 5.5\ngini = 0.499\nsamples = 174\nvalue = [83, 91, 0]"] ;

126 -> 127 ;

128 [label="gini = 0.5\nsamples = 156\nvalue = [77, 79, 0]"] ;

127 -> 128 ;

129 [label="gini = 0.444\nsamples = 18\nvalue = [6, 12, 0]"] ;

127 -> 129 ;

130 [label="X[2] <= 5.5\ngini = 0.518\nsamples = 433\nvalue = [210, 215, 8]"] ;

126 -> 130 ;

131 [label="gini = 0.494\nsamples = 9\nvalue = [5, 4, 0]"] ;

130 -> 131 ;

132 [label="gini = 0.518\nsamples = 424\nvalue = [205, 211, 8]"] ;

130 -> 132 ;

}

Full Tree code above (mother father brother sister):

Reduced tree used to explain(only mother and father removed brother and sister):

0 notes

turbomoparboy · 4 years ago

Text

Logistic Regression

Below is the code and output for the Logistic Regression testing I am doing for this project. Since I am looking at family history and alcoholism I am looking at fathers with grandfathers that were alcoholics and if the individual had drinks in the last 12 months and also if the individual had school or work issues.

The first test is fathers with grandfathers that had drinking problems and if the individual had a drink in the last 12 months. In this test there were 32,588 observations. The regression is significant with a p value of 9.120 e-43. The Coef of the intercept is -1.0617 and 0.3495 of S2AQ3 ( had a drink in the last 12 months) Crit 1 = -1.06 + 0.35 * S2AQ3. Looking at the odds ratio of this test individuals who drank atleast 1 drink in the last 12 months are 1.4 times more likely to have a father and grandfather with alcohol issues. There is 95% certainty that the true population odds ratio falls between 1.35 and 1.5 for this situation. (OR=1.42, 95% CI = 1.35-1.5, p=.0001)

The second test is fathers with grandfathers that had drinking problems and if the individual had issues with school or work. In this test there were 26,434 observations. The regression is significant with a p value of 9.114 e-91. The Coef of the intercept is -0.7530 and 1.4027 of S2BQ1A20 ( had problems with school or work) Crit 1 = -0.7530 + 1.4027 * S2BQ1A20. Looking at the odds ratio of this test individuals who had school or work issues are 4.07 times more likely to have a father and grandfather with alcohol issues. There is 95% certainty that the true popuation odds ratio falls between 3.53 and 4.68 for this situation. (OR=4.07, 95% CI = 3.53 - 4.7, p=.0001)

In my third test I combined both having atleast 1 drink in the last 12 months and also having school or work problems with having fathers and grandfathers with alcohol issues. In this test there were 26,432 observations and have a significant p value of 2.124e-89. The coef of this test is intercept of -0.7404, S2AQ3 0.0157 and S2BQ1A20 1.4006. When combining both of these you will notice the p value of S2AQ3 is no longer significant. Looking at the odds ratio of this test due to confounding I will adjust by only looking at the sifnigicant odds value of S2BQ1A20 ones with work issues are 4.06 times more likely when including atleast 1 drink. (OR=4.06, 95% CI = 3.52- 4.67, p=.0001)

I am able to reject the null hypothesis for all of the tests above due to the very low P values on each of the tests.

There is confounding in the third test because by combining both S2AQ3 and S2BQ1A20, S2AQ3 is no longer significant (p 0.625) unlike without S2BQ1A20 included. With this being the case as I described above I adjusted the odds value i did not include S2AQ3.

Optimization terminated successfully.

Current function value: 0.610574

Iterations 5

Logit Regression Results

==============================================================================

Dep. Variable: crit1 No. Observations: 32588

Model: Logit Df Residuals: 32586

Method: MLE Df Model: 1

Date: Thu, 28 Jan 2021 Pseudo R-squ.: 0.004700

Time: 15:49:18 Log-Likelihood: -19897.

converged: True LL-Null: -19991.

Covariance Type: nonrobust LLR p-value: 9.120e-43

==============================================================================

coef std err z P>|z| [0.025 0.975]

------------------------------------------------------------------------------

Intercept -1.0617 0.021 -50.315 0.000 -1.103 -1.020

S2AQ3 0.3495 0.026 13.577 0.000 0.299 0.400

==============================================================================

Odds Ratios

Intercept 0.345881

S2AQ3 1.418371

dtype: float64

Lower CI Upper CI OR

Intercept 0.331869 0.360485 0.345881

S2AQ3 1.348583 1.491771 1.418371

Optimization terminated successfully.

Current function value: 0.627545

Iterations 4

Logit Regression Results

==============================================================================

Dep. Variable: crit1 No. Observations: 26434

Model: Logit Df Residuals: 26432

Method: MLE Df Model: 1

Date: Thu, 28 Jan 2021 Pseudo R-squ.: 0.01215

Time: 15:49:18 Log-Likelihood: -16589.

converged: True LL-Null: -16793.

Covariance Type: nonrobust LLR p-value: 9.114e-91

==============================================================================

coef std err z P>|z| [0.025 0.975]

------------------------------------------------------------------------------

Intercept -0.7530 0.013 -56.142 0.000 -0.779 -0.727

S2BQ1A20 1.4027 0.072 19.538 0.000 1.262 1.543

==============================================================================

Odds Ratios

Intercept 0.470974

S2BQ1A20 4.066108

dtype: float64

Lower CI Upper CI OR

Intercept 0.458756 0.483519 0.470974

S2BQ1A20 3.532385 4.680473 4.066108

Optimization terminated successfully.

Current function value: 0.627559

Iterations 4

Logit Regression Results

==============================================================================

Dep. Variable: crit1 No. Observations: 26432

Model: Logit Df Residuals: 26429

Method: MLE Df Model: 2

Date: Thu, 28 Jan 2021 Pseudo R-squ.: 0.01216

Time: 15:49:18 Log-Likelihood: -16588.

converged: True LL-Null: -16792.

Covariance Type: nonrobust LLR p-value: 2.124e-89

==============================================================================

coef std err z P>|z| [0.025 0.975]

------------------------------------------------------------------------------

Intercept -0.7404 0.029 -25.800 0.000 -0.797 -0.684

S2AQ3 -0.0157 0.032 -0.488 0.625 -0.079 0.047

S2BQ1A20 1.4006 0.072 19.477 0.000 1.260 1.542

==============================================================================

Odds Ratios

Intercept 0.476902

S2AQ3 0.984436

S2BQ1A20 4.057551

dtype: float64

Lower CI Upper CI OR

Intercept 0.450818 0.504496 0.476902

S2AQ3 0.924348 1.048431 0.984436

S2BQ1A20 3.524161 4.671671 4.057551

0 notes

turbomoparboy · 4 years ago

Text

In this tests below I am looking at natural family members and their association to drinking at least 1 drink in the last 12 months. I also included situations where the natural fathers grandfather and grandmother also had drinking problems. I recoded the yes and no responses to 0 and 1 and removed values of 9. In this test there were 43,062 observations with a dependent variable of Problem. The P value is very low with a value of 8.33e-15. The coef of the intercept is 0.4124 and 0.0591 for S2AQ3 (had at least 1 drink in last year) this makes the beta (0.4124 - 0.0591) = 0.3533. I can reject the null hypothesis and state there is a significant and positive association between family member an drinking in the last 12 months. The next was comparing the same family members as before but looking at the number of drinks. I removed values of 99. This association had 7891 observations a low P value of 0.00567 and an Coef of the intercept of .5095 and S2AQ4D (Number of drinks) of 0.0145. The beta for this association is (0.5095 - 0.0145) = 0.495.I can reject the null hypothesis and state there is a significant and positive association between family member and number of drinks. I then compared the same family members as above (Problem) and the association to having work and school issues due to drinking problems. In this situation I recoded the yes and no responses to 0 and 1 and removed values of 9. In this test there were 34,572 observations with a dependent variable of Problem. The P value is very low with a value of 2.85e-209. The coef of the intercept is 0.4663 and 0.7436 for S2AQ1A20 (had work and school issues) this makes the beta (0.4663 - 0.7436) = -0.2773. I can reject the null hypothesis and state there is a significant association between family member and work and school problems. I combined all of the above into a test to check the family association with drinking at least 1 drink in the last 12 months, number of drinks and having school issues. This had 7866 observations with a P value of 2.56e-26. The Intercept coef for this is 0.2517, s2AQ3 0.2517, S2AQ4D 0.0087, S2BQ1A20 0.5379. The beta is (0.2517 - 0.2517 - 0.0087 - 0.5379) = -0.5466. I can reject the null hypothesis in this case but there is also a negative beta value.

While I did not find evidence of a Confounding variable, I feel with work or school issues with drinking (S2BQ1A20) a confounding variable would be (S2AQ4D) number of drinks as the number of drinks increase this would have a positive effect on the issues a school or work. I feel this is the case because work and school issues have an association with family members with drinking problems but just saying that people with alcohol issues will have work and school problems is not going to be correct because one that drinks a small amount of drinks is not going to have the same issues as one that drinks large amounts of drinks. This is just an assumption I am making.

Looking at the Q-Q plot I would say it is not normally distributed. While there are points near or on the line there is large number of points at approximately -0.75 above 1 the points are higher than the line. This is showing outcomes to the left of the line. This is showing the underlying distribution is not normal but it is not far off. Looking at the influence plot the majority of points are within +-1 but there are still a number of them above this. This shows we have some outliers (2+) in the data and a few extreme outliers (3+).

Code:

# -*- coding: utf-8 -*-

"""

Spyder Editor

#Justin K

#NESARC Family History of Alcoholism

import pandas

import numpy

import statsmodels.api as sm

import seaborn

import statsmodels.formula.api as smf

import matplotlib.pyplot as plt

Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns', None)

#Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows', None)

data = pandas.read_csv('nesarc_pds.csv', low_memory = False)

#setting to numeric

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')

data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce')

data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce')

data['S2AQ4D'] = pandas.to_numeric(data['S2AQ4D'], errors='coerce')

data['S2BQ1A20'] = pandas.to_numeric(data['S2BQ1A20'], errors='coerce')

#Family members with alcohol issues (recode 1,2 to 0,1 )

# after recoding 9s to missing

recode1 = {1: 1, 2: 0}

data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)

data['S2DQ1']= data['S2DQ1'].map(recode1)

data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)

data['S2DQ2']= data['S2DQ2'].map(recode1)

data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)

data['S2DQ3C2']= data['S2DQ3C2'].map(recode1)

data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)

data['S2DQ4C2']= data['S2DQ4C2'].map(recode1)

data['S2AQ3']=data['S2AQ3'].replace(9, numpy.nan)

data['S2AQ3']= data['S2AQ3'].map(recode1)

data['S2BQ1A20']=data['S2BQ1A20'].replace(9, numpy.nan)

data['S2BQ1A20']= data['S2BQ1A20'].map(recode1)

data['S2AQ4D']=data['S2AQ4D'].replace(99, numpy.nan)

# check recode above

chk1 = data['S2DQ1'].value_counts(sort=False, dropna=False)

print (chk1)

# sum Problem

data ['Problem'] = numpy.nansum([data['S2DQ1'], data['S2DQ2'], data['S2DQ3C2'], data['S2DQ4C2']], axis=0)

# check sum

chksum=data[['S2DQ1', 'S2DQ2', 'S2DQ3C2','S2DQ4C2', 'Problem']]

chksum.head(n=50)

chk2 = data['Problem'].value_counts(sort=False, dropna=False)

print (chk2)

#adjust counts if fathers father was Alcoholic

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')

data['S2DQ11'] = pandas.to_numeric(data['S2DQ11'], errors='coerce')

def crit1 (row):

if row['S2DQ1']==1 or row['S2DQ11'] == 1 :

return 1

elif row['S2DQ1']==2 and row['S2DQ11']==2 :

return 0

data['crit1'] = data.apply (lambda row: crit1 (row),axis=1)

chk3 = data['crit1'].value_counts(sort=False, dropna=False)

print (chk3)

#adjust counts if mothers father was alc

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')

data['S2DQ21'] = pandas.to_numeric(data['S2DQ21'], errors='coerce')

def crit2 (row):

if row['S2DQ2']==1 or row['S2DQ21'] == 1 :

return 1

elif row['S2DQ2']==2 and row['S2DQ21']==2 :

return 0

data['crit2'] = data.apply (lambda row: crit2 (row),axis=1)

chk4 = data['crit2'].value_counts(sort=False, dropna=False)

print (chk4)

#Linear Regression Model

reg1 = smf.ols('Problem ~ S2AQ3', data=data).fit()

print (reg1.summary())

crit2

# calculating means for regression observations

sub1 = data[['Problem', 'S2AQ3']].dropna()

# group means & std deviation

print ("Mean")

ds1 = sub1.groupby('S2AQ3').mean()

print (ds1)

print ("Standard deviation")

ds2 = sub1.groupby('S2AQ3').std()

print (ds2)

#bivariate bar graph

seaborn.catplot(x="S2AQ3", y="Problem", data=sub1, kind="bar", ci=None)

plt.xlabel('Problem 0 yes 1 no')

plt.ylabel('Mean Number drank atleast 1 in last 12 months')

#Part 2

#Linear Regression Model

reg2 = smf.ols('Problem ~ S2AQ4D', data=data).fit()

print (reg2.summary())

# calculating means for regression observations

sub2 = data[['Problem', 'S2AQ4D']].dropna()

# group means & std deviation

print ("Mean")

ds3 = sub2.groupby('S2AQ4D').mean()

print (ds3)

print ("Standard deviation")

ds4 = sub2.groupby('S2AQ4D').std()

print (ds4)

#bivariate bar graph

seaborn.catplot(x="S2AQ4D", y="Problem", data=sub2, kind="bar", ci=None)

plt.xlabel('Problem 0 yes 1 no')

plt.ylabel('Mean Number dranks on dat drinking in last 12 months')

sub2['S2AQ4D_c'] = (sub2['S2AQ4D'] - sub2['S2AQ4D'].mean())

print (sub2['S2AQ4D_c'].mean())

#Part 3 S2BQ1A20

#Linear Regression Model

reg3 = smf.ols('Problem ~ S2BQ1A20', data=data).fit()

print (reg3.summary())

# calculating means for regression observations

sub3 = data[['Problem', 'S2BQ1A20']].dropna()

# group means & std deviation

print ("Mean")

ds5 = sub3.groupby('S2BQ1A20').mean()

print (ds5)

print ("Standard deviation")

ds6 = sub3.groupby('S2BQ1A20').std()

print (ds6)

#bivariate bar graph

seaborn.catplot(x="S2BQ1A20", y="Problem", data=sub3, kind="bar", ci=None)

plt.xlabel('Problem 0 yes 1 no')

plt.ylabel('Mean Number have job or school problems')

#Linear Regression Model

reg4 = smf.ols('Problem ~ S2AQ3 + S2AQ4D + S2BQ1A20', data=data).fit()

print (reg4.summary())

# calculating means for regression observations

sub4 = data[['S2AQ3', 'S2AQ4D', 'S2BQ1A20']].dropna()

sub4 = sub2['S2AQ4D_c']

sub4 = data['Problem']

#Linear Regression Model

reg5 = smf.ols('Problem ~ S2AQ4D_c' , data=sub2).fit()

print (reg5.summary())

#Q-Q

fig4=sm.qqplot(reg5.resid, line='r')

# simple plot of residuals

stdres=pandas.DataFrame(reg5.resid_pearson)

plt.plot(stdres, 'o', ls='None')

l = plt.axhline(y=0, color='r')

plt.ylabel('Standardized Residual')

plt.xlabel('Observation Number')

# leverage plot

fig5=sm.graphics.influence_plot(reg5, size=8)

print(fig5)

#upper-case all DataFrame column names

data.columns = list(map(str.upper, data.columns))

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

Output:

OLS Regression Results

Dep. Variable: Problem R-squared: 0.015

Model: OLS Adj. R-squared: 0.015

Method: Least Squares F-statistic: 59.37

Date: Fri, 22 Jan 2021 Prob (F-statistic): 2.56e-26

Time: 10:28:06 Log-Likelihood: -9480.7

No. Observations: 7866 AIC: 1.897e+04

Df Residuals: 7863 BIC: 1.899e+04

Df Model: 2

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 0.2517 0.006 39.035 0.000 0.239 0.264

S2AQ3 0.2517 0.006 39.035 0.000 0.239 0.264

S2AQ4D 0.0087 0.005 1.663 0.096 -0.002 0.019

S2BQ1A20 0.5379 0.051 10.540 0.000 0.438 0.638

Omnibus: 2032.059 Durbin-Watson: 2.013

Prob(Omnibus): 0.000 Jarque-Bera (JB): 4313.010

Skew: 1.520 Prob(JB): 0.00

Kurtosis: 4.979 Cond. No. 1.69e+16

OLS Regression Results

Dep. Variable: Problem R-squared: 0.001

Model: OLS Adj. R-squared: 0.001

Method: Least Squares F-statistic: 7.658

Date: Fri, 22 Jan 2021 Prob (F-statistic): 0.00567

Time: 10:28:06 Log-Likelihood: -9559.0

No. Observations: 7891 AIC: 1.912e+04

Df Residuals: 7889 BIC: 1.914e+04

Df Model: 1

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 0.5348 0.009 58.456 0.000 0.517 0.553

S2AQ4D_c 0.0145 0.005 2.767 0.006 0.004 0.025

Omnibus: 2090.081 Durbin-Watson: 2.022

Prob(Omnibus): 0.000 Jarque-Bera (JB): 4509.428

Skew: 1.550 Prob(JB): 0.00

Kurtosis: 5.026 Cond. No. 1.75

0 notes

turbomoparboy · 4 years ago

Text

Regression modeling - Linear Regression week 2:

Below is the program I wrote to test for basic linear regression. I am looking at natural family members and their association to drinking atleast 1 drink in the last 12 months. I also included situations where the father and grandfather as well as grandmother also had alcohol issues. In the code below you will see where I also had to recode the yes and no answers from 1 and 2 to 0 and 1. I also removed the values of 9 for these. The dependent variable in this situation is Problem and there were 43,093 observations made. The coefficient of the intercept is 0.5279 and -0.0571 of S2AQ3 which is if the person had a drink in the last year. This makes the Beta (0.5279 - 0.0571) = 0.4708. The P value of this is very small so I can reject the null hypothesis and state there is a significant and positively association between family member alcoholics and drinking in the last 12 months.

# -*- coding: utf-8 -*-

"""

Spyder Editor

"""

#Justin K

#NESARC Family History of Alcoholism

import pandas

import numpy

import statsmodels.api as sm

import seaborn

import statsmodels.formula.api as smf

import matplotlib.pyplot as plt

#Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns', None)

#Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows', None)

data = pandas.read_csv('nesarc_pds.csv', low_memory = False)

#setting to numeric

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')

data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce')

data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce')

#Family members with alcohol issues (recode 1,2 to 0,1 )

# after recoding 9s to missing

recode1 = {1: 1, 2: 0}

data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)

data['S2DQ1']= data['S2DQ1'].map(recode1)

data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)

data['S2DQ2']= data['S2DQ2'].map(recode1)

data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)

data['S2DQ3C2']= data['S2DQ3C2'].map(recode1)

data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)

data['S2DQ4C2']= data['S2DQ4C2'].map(recode1)

data['S2AQ3']=data['S2AQ3'].replace(9, numpy.nan)

data['S2AQ3']= data['S2AQ3'].map(recode1)

# check recode above

chk1 = data['S2DQ1'].value_counts(sort=False, dropna=False)

print (chk1)

# sum Problem

data ['Problem'] = numpy.nansum([data['S2DQ1'], data['S2DQ2'], data['S2DQ3C2'], data['S2DQ4C2']], axis=0)

# check sum

chksum=data[['S2DQ1', 'S2DQ2', 'S2DQ3C2','S2DQ4C2', 'Problem']]

chksum.head(n=50)

chk2 = data['Problem'].value_counts(sort=False, dropna=False)

print (chk2)

#adjust counts if fathers father was Alcoholic

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')

data['S2DQ11'] = pandas.to_numeric(data['S2DQ11'], errors='coerce')

def crit1 (row):

if row['S2DQ1']==1 or row['S2DQ11'] == 1 :

return 1

elif row['S2DQ1']==2 and row['S2DQ11']==2 :

return 0

data['crit1'] = data.apply (lambda row: crit1 (row),axis=1)

chk3 = data['crit1'].value_counts(sort=False, dropna=False)

print (chk3)

#adjust counts if mothers father was alc

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')

data['S2DQ21'] = pandas.to_numeric(data['S2DQ21'], errors='coerce')

def crit2 (row):

if row['S2DQ2']==1 or row['S2DQ21'] == 1 :

return 1

elif row['S2DQ2']==2 and row['S2DQ21']==2 :

return 0

data['crit2'] = data.apply (lambda row: crit2 (row),axis=1)

chk4 = data['crit2'].value_counts(sort=False, dropna=False)

print (chk4)

#Linear Regression Model

reg1 = smf.ols('Problem ~ S2AQ3', data=data).fit()

print (reg1.summary())

crit2

# calculating means for regression observations

sub1 = data[['Problem', 'S2AQ3']].dropna()

# group means & std deviation

print ("Mean")

ds1 = sub1.groupby('S2AQ3').mean()

print (ds1)

print ("Standard deviation")

ds2 = sub1.groupby('S2AQ3').std()

print (ds2)

#bivariate bar graph

seaborn.catplot(x="S2AQ3", y="Problem", data=sub1, kind="bar", ci=None)

plt.xlabel('Problem 0 yes 1 no')

plt.ylabel('Mean Number drank atleast 1 in last 12 months')

#upper-case all DataFrame column names

data.columns = list(map(str.upper, data.columns))

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

Output:

1.000000 8124

0.000000 32445

nan 2524

Name: S2DQ1, dtype: int64

2.000000 3413

1.000000 9059

0.000000 29525

4.000000 182

3.000000 914

Name: Problem, dtype: int64

1.000000 9878

nan 33215

Name: crit1, dtype: int64

nan 40762

1.000000 2331

Name: crit2, dtype: int64

OLS Regression Results

==============================================================================

Dep. Variable: Problem R-squared: 0.001

Model: OLS Adj. R-squared: 0.001

Method: Least Squares F-statistic: 60.30

Date: Fri, 15 Jan 2021 Prob (F-statistic): 8.33e-15

Time: 14:39:11 Log-Likelihood: -49491.

No. Observations: 43062 AIC: 9.899e+04

Df Residuals: 43060 BIC: 9.900e+04

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

Intercept 0.4124 0.006 68.563 0.000 0.401 0.424

S2AQ3 0.0591 0.008 7.765 0.000 0.044 0.074

==============================================================================

Omnibus: 14467.736 Durbin-Watson: 2.008

Prob(Omnibus): 0.000 Jarque-Bera (JB): 40856.637

Skew: 1.815 Prob(JB): 0.00

Kurtosis: 6.098 Cond. No. 3.03

==============================================================================

Mean

Problem

S2AQ3

0.000000 0.412447

1.000000 0.471499

Standard deviation

Problem

S2AQ3

0.000000 0.747297

1.000000 0.773293

0 notes

turbomoparboy · 4 years ago

Text

Part 1:

The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a study that included (N=43,0093) individuals. This study included researched family history of Alcoholism. I am using this study data to determine if family history influences alcoholism. The study shows that the majority approximately (76%) of natural fathers were not alcoholics or problem drinkers. Even less natural mothers approximately (94%) were not alcoholics or problem drinkers. Full Brothers and Sisters were also included in the study. Approximately (88%) of full brothers did not have alcohol issues and 94% of full sisters did not have alcohol issues. I combined the family members into a new grouping I called Family this way I can better research how one member compares to the entire family. I am also looking into the quantity (1-98) of alcoholic drinks consumed compared to the size of the drinks based on the timeframe. The study looked at the how often coolers were consumed over the last 12 months. The majority of individuals did not answer these questions so I will need to take this into account. The majority of individuals that did consume coolers only consumed 1-2 in the last 12 months with the next largest consumption was monthly. Since the majority of individuals did not consume coolers in the last 12 months the size of cooler (1-47) that was consumed was left blank in the study. Only (18%) of individuals answered this question. I am going to research these metrics to determine if family history as well as consumption has an effect on alcoholism.

Part 2:

The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a survey conducted face to face with (N=43,093) individuals. The purpose of the survey is to gather detailed measurements of drug and alcohol issues in the US population. The study was conducted in 2 waves across the 50 US states. The first wave took place in 2001 – 2002. The second wave which was a re-interview of (N=34.653) took place in 2004-2005.

Part 3:

0 notes

turbomoparboy · 4 years ago

Text

The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) was a study that included (N=43,0093) individuals. This study included researched family history of Alcoholism. The study was conducted in 2 waves across the 50 states. The specific target of individuals surveyed were U.S. civilian non-institutionalized individuals in households and selected group quarters. The first was conducted face to face with individuals aged 18 and over. Blacks, Hispanics and adults aged 18-24 were oversampled. This was taken into account. The second wave was a re-interview of (N=34,653) of the individuals interviewed in wave 1. I am using this study data to determine if family history influences alcoholism. I am going to be looking into family members specifically natural mothers, fathers, brothers and sisters to determine if it can be hereditary.

0 notes

turbomoparboy · 4 years ago

Text

0 notes

turbomoparboy · 4 years ago

Text

0 notes

turbomoparboy · 5 years ago

Text

Below is the program where I am calculating correlation between the number of coolers one drinks and the size of the coolers. As shown in the scatterplot there is a positive correlation between the size of the cooler and the amount but then the correlation appears to drop off. Since there is low correlation value and a higher P value this explains the very slight positive correlation.

Program:

# -*- coding: utf-8 -*-

"""

Spyder Editor

"""

#Justin K

#NESARC Family History of Alcoholism

import pandas

import numpy

import scipy.stats

import seaborn

import scipy

import matplotlib.pyplot as plt

#Set PANDAS to show all columns in DataFrame

#pandas.set_option('display.max_columns', None)

#Set PANDAS to show all rows in DataFrame

#pandas.set_option('display.max_rows', None)

data = pandas.read_csv('nesarc_pds.csv', low_memory = False)

#setting to numeric

data['S2AQ4CR'] = pandas.to_numeric(data['S2AQ4CR'], errors='coerce')

data['S2AQ4D'] = pandas.to_numeric(data['S2AQ4D'], errors='coerce')

#remove unknown data

data['S2AQ4CR']=data['S2AQ4CR'].replace(99, numpy.nan)

data['S2AQ4D']=data['S2AQ4D'].replace(99, numpy.nan)

scat1 = seaborn.regplot(x="S2AQ4CR", y="S2AQ4D", fit_reg=True, data=data)

plt.xlabel('Cooler Size')

plt.ylabel('Number of coolers consumed')

plt.title('Scatterplot for the association between cooler size and the number of coolers consumed')

data_clean=data.dropna()

print ('Association between cooler size and the number of coolers consumed')

print (scipy.stats.pearsonr(data_clean['S2AQ4CR'], data_clean['S2AQ4D']))

#upper-case all DataFrame column names

data.columns = list(map(str.upper, data.columns))

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

Results:

Association between cooler size and the number of coolers consumed

(0.09628360010686936, 0.12971994878189647)

0 notes

turbomoparboy · 5 years ago

Text

Below is the program I wrote to compare family members that have issues with alcohol and individuals that suffer alcohol abuse or dependence within the last 12 months. The table ALCABDEP12DX has the values 0 = No Dependenc, 1 = Alcohol Abuse only, 2 = Alcohol Dependence Only, 3 = Alcohol Abuse and Dependence. The Family category is 1 = Father, 2 = Mother, 3 = Brother, 4 = Sister. I am going to reject the null hypothesis due to the P value resuts of the tests below.

Program:

# -*- coding: utf-8 -*-

"""

Spyder Editor

"""

#Justin K

#NESARC Family History of Alcoholism

import pandas

import numpy

import scipy.stats

import seaborn

import matplotlib.pyplot as plt

#Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns', None)

#Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows', None)

data = pandas.read_csv('nesarc_pds.csv', low_memory = False)

#setting to numeric

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'])

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'])

data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'])

data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'])

data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'])

data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'])

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')

data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce')

data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce')

data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'], errors='coerce')

data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'], errors='coerce')

#remove unknown data

data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)

data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)

data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)

data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)

data['S2DQ5C1']=data['S2DQ5C1'].replace(99, numpy.nan)

data['S2DQ6C1']=data['S2DQ6C1'].replace(99, numpy.nan)

#family member Alcoholic variable, categorical 1 through 4

def Family (row):

if row['S2DQ1'] == 1 :

return 1

if row['S2DQ2'] == 1:

return 2

if row['S2DQ3C2'] == 1:

return 3

if row['S2DQ4C2'] == 1:

return 4

data['Family'] = data.apply (lambda row: Family (row),axis=1)

#count of number of parent alcoholics

data['NUMPARENT']=data['S2DQ1'] + data['S2DQ2']

sub1 = data[['Family', 'ALCABDEP12DX']].dropna()

# contingency table

ct1=pandas.crosstab(sub1['ALCABDEP12DX'], sub1['Family'])

print (ct1)

# percentages

colsum=ct1.sum(axis=0)

colpct=ct1/colsum

print(colpct)

# chi-square

print ('chi-square value, p value, expected counts')

cs1= scipy.stats.chi2_contingency(ct1)

print (cs1)

sub1["ALCABDEP12DX"] = sub1["ALCABDEP12DX"].astype('category')

# new code for setting variables to numeric:

sub1['Family'] = pandas.to_numeric(sub1['Family'], errors='coerce')

# graph

seaborn.factorplot(x="ALCABDEP12DX", y="Family", data=sub1, kind="bar", ci=None)

plt.xlabel('Dependence in last 12 months')

plt.ylabel('Family Member')

recode1 = {0: 0, 1: 1}

sub1['COMP1v1']= sub1['ALCABDEP12DX'].map(recode1)

# contingency table 1

ct1=pandas.crosstab(sub1['Family'], sub1['COMP1v1'])

print (ct1)

# percentages 1

colsum=ct1.sum(axis=0)

colpct=ct1/colsum

print(colpct)

print ('chi-square value, p value, expected counts')

cs1= scipy.stats.chi2_contingency(ct1)

print (cs1)

recode2 = {0: 0, 2: 2}

sub1['COMP1v2']= sub1['ALCABDEP12DX'].map(recode2)

# contingency table 2

ct2=pandas.crosstab(sub1['Family'], sub1['COMP1v2'])

print (ct2)

# percentages 2

colsum=ct2.sum(axis=0)

colpct=ct2/colsum

print(colpct)

print ('chi-square value, p value, expected counts')

cs2= scipy.stats.chi2_contingency(ct2)

print (cs2)

recode3 = {0: 0, 3: 3}

sub1['COMP1v3']= sub1['ALCABDEP12DX'].map(recode3)

# contingency table 3

ct3=pandas.crosstab(sub1['Family'], sub1['COMP1v3'])

print (ct3)

# percentages 3

colsum=ct3.sum(axis=0)

colpct=ct3/colsum

print(colpct)

print ('chi-square value, p value, expected counts')

cs3= scipy.stats.chi2_contingency(ct3)

print (cs3)

#upper-case all DataFrame column names

data.columns = list(map(str.upper, data.columns))

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

Result:

Family 1.0 2.0 3.0 4.0

ALCABDEP12DX

0 7118 904 3249 818

1 481 68 154 42

2 168 27 42 9

3 357 47 63 21

Family 1.0 2.0 3.0 4.0

ALCABDEP12DX

0 0.876169 0.864245 0.926169 0.919101

1 0.059207 0.065010 0.043900 0.047191

2 0.020679 0.025813 0.011973 0.010112

3 0.043944 0.044933 0.017959 0.023596

chi-square value, p value, expected counts

(90.82581721383129, 1.1115172651424374e-15, 9, array([[7238.43130896, 931.9792158 , 3125.60524764, 792.98422759],

[ 446.07753538, 57.43440448, 192.61939858, 48.86866156],

[ 147.29540094, 18.96491745, 63.60318396, 16.13649764],

[ 292.19575472, 37.62146226, 126.17216981, 32.01061321]]))

COMP1v1 0.0 1.0

Family

1.0 7118 481

2.0 904 68

3.0 3249 154

4.0 818 42

COMP1v1 0.0 1.0

Family

1.0 0.588800 0.645638

2.0 0.074779 0.091275

3.0 0.268757 0.206711

4.0 0.067665 0.056376

chi-square value, p value, expected counts

(17.873277986272203, 0.00046712719994895926, 3, array([[7157.88616176, 441.11383824],

[ 915.57643759, 56.42356241],

[3205.45948262, 197.54051738],

[ 810.07791803, 49.92208197]]))

COMP1v2 0.0 2.0

Family

1.0 7118 168

2.0 904 27

3.0 3249 42

4.0 818 9

COMP1v2 0.0 2.0

Family

1.0 0.588800 0.682927

2.0 0.074779 0.109756

3.0 0.268757 0.170732

4.0 0.067665 0.036585

chi-square value, p value, expected counts

(19.680798192577477, 0.00019765989591122253, 3, array([[7140.69347385, 145.30652615],

[ 912.4328334 , 18.5671666 ],

[3225.36676125, 65.63323875],

[ 810.5069315 , 16.4930685 ]]))

COMP1v3 0.0 3.0

Family

1.0 7118 357

2.0 904 47

3.0 3249 63

4.0 818 21

COMP1v3 0.0 3.0

Family

1.0 0.588800 0.731557

2.0 0.074779 0.096311

3.0 0.268757 0.129098

4.0 0.067665 0.043033

chi-square value, p value, expected counts

(57.96858821503683, 1.5963364821312778e-12, 3, array([[7184.9626302 , 290.0373698 ],

[ 914.10026238, 36.89973762],

[3183.49113461, 128.50886539],

[ 806.44597281, 32.55402719]]))

0 notes

turbomoparboy · 5 years ago

Text

Below is an analysis of variance for assignment 1. In this analysis I used my original research question family history and alcoholism to determine this analysis. I chose family which is a categorized set I created listing father, mother, brother, and sister. I also used consumer which states if the person is a current or past drinker or has been a lifetime abstainer of alcoholic drinks. Since the P value In the OLS output has a very low value I can reject the null hypothesis value. I also included the mean and standard deviation values as well.

Code:

# -*- coding: utf-8 -*-

"""

Spyder Editor

"""

#Justin K

#NESARC Family History of Alcoholism

import pandas

import numpy

import seaborn

import matplotlib.pyplot as plt

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

#Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns', None)

#Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows', None)

data = pandas.read_csv('nesarc_pds.csv', low_memory = False)

#setting to numeric

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'])

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'])

data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'])

data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'])

data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'])

data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'])

#remove unknown data

data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)

data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)

data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)

data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)

data['S2DQ5C1']=data['S2DQ5C1'].replace(99, numpy.nan)

data['S2DQ6C1']=data['S2DQ6C1'].replace(99, numpy.nan)

#family member Alcoholic variable, categorical 1 through 4

def Family (row):

if row['S2DQ1'] == 1 :

return 1

if row['S2DQ2'] == 1:

return 2

if row['S2DQ3C2'] == 1:

return 3

if row['S2DQ4C2'] == 1:

return 4

data['Family'] = data.apply (lambda row: Family (row),axis=1)

#count of number of parent alcoholics

data['NUMPARENT']=data['S2DQ1'] + data['S2DQ2']

model1 = smf.ols(formula='Family ~ C(CONSUMER)', data=data)

results1 = model1.fit()

print (results1.summary())

sub1 = data[['Family', 'CONSUMER']].dropna()

print ('means for Alcohol consumption')

mean= sub1.groupby('CONSUMER').mean()

print (mean)

print ('standard deviations for Alcohol consumption')

sd1 = sub1.groupby('CONSUMER').std()

print (sd1)

mc1 = multi.MultiComparison(data['Family'], data['CONSUMER'])

result = mc1.tukeyhsd()

print(result.summary())

#upper-case all DataFrame column names

data.columns = list(map(str.upper, data.columns))

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

Output:

OLS Regression Results

==============================================================================

Dep. Variable: Family R-squared: 0.008

Model: OLS Adj. R-squared: 0.008

Method: Least Squares F-statistic: 54.00

Date: Fri, 27 Nov 2020 Prob (F-statistic): 4.36e-24

Time: 18:32:36 Log-Likelihood: -19695.

No. Observations: 13568 AIC: 3.940e+04

Df Residuals: 13565 BIC: 3.942e+04

Df Model: 2

Covariance Type: nonrobust

====================================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------------

Intercept 1.7350 0.011 158.687 0.000 1.714 1.756

C(CONSUMER)[T.2] 0.0963 0.022 4.328 0.000 0.053 0.140

C(CONSUMER)[T.3] 0.2706 0.027 10.125 0.000 0.218 0.323

==============================================================================

Omnibus: 3369.290 Durbin-Watson: 1.999

Prob(Omnibus): 0.000 Jarque-Bera (JB): 1845.921

Skew: 0.771 Prob(JB): 0.00

Kurtosis: 2.057 Cond. No. 3.30

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

means for Alcohol consumption

Family

CONSUMER

1 1.734998

2 1.831338

3 2.005568

standard deviations for Alcohol consumption

Family

CONSUMER

1 1.014713

2 1.047685

3 1.099999

Multiple Comparison of Means - Tukey HSD, FWER=0.05

================================================

group1 group2 meandiff p-adj lower upper reject

------------------------------------------------

1 2 nan 0.5566 nan nan False

1 3 nan 0.5566 nan nan False

2 3 nan 0.5566 nan nan False

------------------------------------------------

0 notes

turbomoparboy · 5 years ago

Text

Below is the program I data managed. Each one of my data selections contained unknown data which is unusable for my research and also adds data which makes the data more complex than it is needed. I removed all of the unknown data from each of my selections. To also assist in my research I created a new variable to combine all of the family counts. By doing this I was able to output the counts of parents and siblings which are alcoholics. I also used this new variable to display the percentages of parents and siblings which are alcoholics.

PROGRAM:

# -*- coding: utf-8 -*-

"""

Spyder Editor

"""

#Justin K

#NESARC Family History of Alcoholism

import pandas

import numpy

data = pandas.read_csv('nesarc_pds.csv', low_memory = False)

#setting to numeric

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'])

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'])

data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'])

data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'])

data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'])

data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'])

#remove unknown data

data['S2DQ1']=data['S2DQ1'].replace(9, numpy.nan)

data['S2DQ2']=data['S2DQ2'].replace(9, numpy.nan)

data['S2DQ3C2']=data['S2DQ3C2'].replace(9, numpy.nan)

data['S2DQ4C2']=data['S2DQ4C2'].replace(9, numpy.nan)

data['S2DQ5C1']=data['S2DQ5C1'].replace(99, numpy.nan)

data['S2DQ6C1']=data['S2DQ6C1'].replace(99, numpy.nan)

#family member Alcoholic variable, categorical 1 through 4

def Family (row):

if row['S2DQ1'] == 1 :

return 1

if row['S2DQ2'] == 1:

return 2

if row['S2DQ3C2'] == 1:

return 3

if row['S2DQ4C2'] == 1:

return 4

data['Family'] = data.apply (lambda row: Family (row),axis=1)

#count of number of parent alcoholics

data['NUMPARENT']=data['S2DQ1'] + data['S2DQ2']

#counts and percentages with titles and drop NAN data

print ('counts of natural fathers ever alcoholic or problem drinker 1=yes 2=No (S2DQ1)')

c1 = data['S2DQ1'].value_counts(sort=False, dropna=True)

print (c1)

print ('percentages of natural fathers ever alcoholic or problem drinker 1=yes 2=No (S2DQ1)')

p1 = data['S2DQ1'].value_counts(sort=False, normalize=True, dropna=True)

print (p1)

print ('counts of natural mothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ2)')

c2 = data['S2DQ2'].value_counts(sort=False, dropna=True)

print(c2)

print ('percentages of natural mothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ2)')

p2 = data['S2DQ2'].value_counts(sort=False, normalize=True, dropna=True)

print (p2)

print ('counts of natural brothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ3C2)')

c3 = data['S2DQ3C2'].value_counts(sort=False, dropna=True)

print(c3)

print ('percentages of natural brothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ3C2)')

p3 = data['S2DQ3C2'].value_counts(sort=False, normalize=True)

print (p3)

print ('counts of natural sisters ever alcoholic or problem drinker 1=yes 2=No (S2DQ4C2)')

c4 = data['S2DQ4C2'].value_counts(sort=False, dropna=True)

print(c4)

print ('percentages of natural sisters ever alcoholic or problem drinker 1=yes 2=No (S2DQ4C2)')

p4 = data['S2DQ4C2'].value_counts(sort=False, dropna=False, normalize=True)

print (p4)

print ('counts of natural sons ever alcoholic or problem drinker (S2DQ5C1)')

c5 = data['S2DQ5C1'].value_counts(sort=True, dropna=True)

print(c5)

print ('percentages of natural sons ever alcoholic or problem drinker (S2DQ5C1)')

p5 = data['S2DQ5C1'].value_counts(sort=True, dropna=True, normalize=True)

print (p5)

print ('counts of natural daughters ever alcoholic or problem drinker (S2DQ6C1)')

c6 = data['S2DQ6C1'].value_counts(sort=True, dropna=True)

print(c6)

print ('percentages of natural daughters ever alcoholic or problem drinker (S2DQ6C1)')

p6 = data['S2DQ6C1'].value_counts(sort=True, dropna=True, normalize=True)

print (p6)

#counts and percentage for family member variable created above

print ('Family Counts with Alcolholic 1=Father 2=Mother 3=Brother 4=Sister')

c7 = data['Family'].value_counts(sort=True, dropna=True)

print (c7)

print ('Family Percentage with Alcolholic 1=Father 2=Mother 3=Brother 4=Sister')

p7 = data['Family'].value_counts(sort=True, dropna=True, normalize=True)

print (p7)

#upper-case all DataFrame column names

data.columns = list(map(str.upper, data.columns))

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

OUTPUT:

counts of natural fathers ever alcoholic or problem drinker 1=yes 2=No (S2DQ1)

1.000000 8124

2.000000 32445

Name: S2DQ1, dtype: int64

percentages of natural fathers ever alcoholic or problem drinker 1=yes 2=No (S2DQ1)

1.000000 0.200251

2.000000 0.799749

Name: S2DQ1, dtype: float64

counts of natural mothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ2)

2.000000 39553

1.000000 2311

Name: S2DQ2, dtype: int64

percentages of natural mothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ2)

2.000000 0.944797

1.000000 0.055203

Name: S2DQ2, dtype: float64

counts of natural brothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ3C2)

1.000000 6213

2.000000 35766

Name: S2DQ3C2, dtype: int64

percentages of natural brothers ever alcoholic or problem drinker 1=yes 2=No (S2DQ3C2)

1.000000 0.148003

2.000000 0.851997

Name: S2DQ3C2, dtype: float64

counts of natural sisters ever alcoholic or problem drinker 1=yes 2=No (S2DQ4C2)

2.000000 39381

1.000000 2707

Name: S2DQ4C2, dtype: int64

percentages of natural sisters ever alcoholic or problem drinker 1=yes 2=No (S2DQ4C2)

2.000000 0.913861

1.000000 0.062818

nan 0.023322

Name: S2DQ4C2, dtype: float64

counts of natural sons ever alcoholic or problem drinker (S2DQ5C1)

0.000000 40412

1.000000 1286

2.000000 406

3.000000 37

4.000000 11

6.000000 4

5.000000 2

7.000000 1

9.000000 1

Name: S2DQ5C1, dtype: int64

percentages of natural sons ever alcoholic or problem drinker (S2DQ5C1)

0.000000 0.958539

1.000000 0.030503

2.000000 0.009630

3.000000 0.000878

4.000000 0.000261

6.000000 0.000095

5.000000 0.000047

7.000000 0.000024

9.000000 0.000024

Name: S2DQ5C1, dtype: float64

counts of natural daughters ever alcoholic or problem drinker (S2DQ6C1)

0.000000 41347

1.000000 555

2.000000 316

3.000000 8

4.000000 4

9.000000 2

5.000000 1

Name: S2DQ6C1, dtype: int64

percentages of natural daughters ever alcoholic or problem drinker (S2DQ6C1)

0.000000 0.979021

1.000000 0.013141

2.000000 0.007482

3.000000 0.000189

4.000000 0.000095

9.000000 0.000047

5.000000 0.000024

Name: S2DQ6C1, dtype: float64

Family Counts with Alcolholic 1=Father 2=Mother 3=Brother 4=Sister

1.000000 8124

3.000000 3508

2.000000 1046

4.000000 890

Name: Family, dtype: int64

Family Percentage with Alcolholic 1=Father 2=Mother 3=Brother 4=Sister

1.000000 0.598762

3.000000 0.258550

2.000000 0.077093

4.000000 0.065596

Name: Family, dtype: float64

0 notes

turbomoparboy · 5 years ago

Text

Below is the program I wrote to start my research on family history with alcoholism. In this initial program I calculated both the counts and percentages of survey results showing the alcoholic or drinking problems. These are separated by natural mother, father, brother and sister. I also did counts and percentages of the counts of sons and daughters who are having issues with alcohol. Since there is the possibility of missing data on the number of children I am displaying missing data using "dropna=False"

Program:

# -*- coding: utf-8 -*-

"""

Spyder Editor

"""

#Justin K

#NESARC Family History of Alcoholism

import pandas

import numpy

data = pandas.read_csv('nesarc_pds.csv', low_memory = False)

#setting to numeric

data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'])

data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'])

data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'])

data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'])

data['S2DQ5C1'] = pandas.to_numeric(data['S2DQ5C1'])

data['S2DQ6C1'] = pandas.to_numeric(data['S2DQ6C1'])

#counts and percentages with titles and not dropping missing data

print ('counts of natural fathers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ1)')

c1 = data['S2DQ1'].value_counts(sort=False, dropna=False)

print (c1)

print ('percentages of natural fathers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ1)')

p1 = data['S2DQ1'].value_counts(sort=False, normalize=True, dropna=False)

print (p1)

print ('counts of natural mothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ2)')

c2 = data['S2DQ2'].value_counts(sort=False, dropna=False)

print(c2)

print ('percentages of natural mothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ2)')

p2 = data['S2DQ2'].value_counts(sort=False, normalize=True, dropna=False)

print (p2)

print ('counts of natural brothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ3C2)')

c3 = data['S2DQ3C2'].value_counts(sort=False, dropna=False)

print(c3)

print ('percentages of natural brothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ3C2)')

p3 = data['S2DQ3C2'].value_counts(sort=False, normalize=True)

print (p3)

print ('counts of natural sisters ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ4C2)')

c4 = data['S2DQ4C2'].value_counts(sort=False, dropna=False)

print(c4)

print ('percentages of natural sisters ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ4C2)')

p4 = data['S2DQ4C2'].value_counts(sort=False, dropna=False, normalize=True)

print (p4)

print ('counts of natural sons ever alcoholic or problem drinker (S2DQ5C1)')

c5 = data['S2DQ5C1'].value_counts(sort=False, dropna=False)

print(c5)

print ('percentages of natural sons ever alcoholic or problem drinker (S2DQ5C1)')

p5 = data['S2DQ5C1'].value_counts(sort=False, dropna=False, normalize=True)

print (p5)

print ('counts of natural daughters ever alcoholic or problem drinker (S2DQ6C1)')

c6 = data['S2DQ6C1'].value_counts(sort=False, dropna=False)

print(c6)

print ('percentages of natural daughters ever alcoholic or problem drinker (S2DQ6C1)')

p6 = data['S2DQ6C1'].value_counts(sort=False, dropna=False, normalize=True)

print (p6)

#upper-case all DataFrame column names

data.columns = list(map(str.upper, data.columns))

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

Output:

counts of natural fathers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ1)

1 8124

2 32445

9 2524

Name: S2DQ1, dtype: int64

percentages of natural fathers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ1)

1 0.188522

2 0.752907

9 0.058571

Name: S2DQ1, dtype: float64

counts of natural mothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ2)

1 2311

2 39553

9 1229

Name: S2DQ2, dtype: int64

percentages of natural mothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ2)

1 0.053628

2 0.917852

9 0.028520

Name: S2DQ2, dtype: float64

counts of natural brothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ3C2)

1 6213

2 35766

9 1114

Name: S2DQ3C2, dtype: int64

percentages of natural brothers ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ3C2)

1 0.144177

2 0.829972

9 0.025851

Name: S2DQ3C2, dtype: float64

counts of natural sisters ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ4C2)

1 2707

2 39381

9 1005

Name: S2DQ4C2, dtype: int64

percentages of natural sisters ever alcoholic or problem drinker 1=yes 2=No 9=Unknown (S2DQ4C2)

1 0.062818

2 0.913861

9 0.023322

Name: S2DQ4C2, dtype: float64

counts of natural sons ever alcoholic or problem drinker (S2DQ5C1)

0 40412

1 1286

2 406

3 37

99 933

4 11

5 2

6 4

7 1

9 1

Name: S2DQ5C1, dtype: int64

percentages of natural sons ever alcoholic or problem drinker (S2DQ5C1)

0 0.937786

1 0.029842

2 0.009421

3 0.000859

99 0.021651

4 0.000255

5 0.000046

6 0.000093

7 0.000023

9 0.000023

Name: S2DQ5C1, dtype: float64

counts of natural daughters ever alcoholic or problem drinker (S2DQ6C1)

0 41347

1 555

2 316

3 8

99 860

4 4

5 1

9 2

Name: S2DQ6C1, dtype: int64

percentages of natural daughters ever alcoholic or problem drinker (S2DQ6C1)

0 0.959483

1 0.012879

2 0.007333

3 0.000186

99 0.019957

4 0.000093

5 0.000023

9 0.000046

Name: S2DQ6C1, dtype: float64

0 notes