kpenlearnspython-blog - Tumblr blog

kpenlearnspython-blog · 6 years ago

Text

Lesson 3 Week 4

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

Hypothesis for this lesson:

Those who drink more than 30 drinks a month with documented history of alcoholism are more likely to have a family history of alcoholism.

Summary:

I decided to look at the response variable with either a yes (1) or no (0) to whether the individual consumed >30 drinks/month and the explanatory variables of family history of alcoholism (yes=1, no=0), centered income, and centered age.

After adjusting for these factors, the odds of drinking more than 30 drinks/month were approximately the same for those with alcoholic family than those without (OR=1.01, 95% CI: 1.000989-1.018870). This was technically statistically significant, at p-value=0.029, but this only became significant when additional variables were factored in (considering only family history of alcoholism, the p-value is 0.964).

Income, sex, and age were also predicted to be statistically significant with p-values of approximately 0, but when you look at the odds ratios, the difference between the two groups appears fairly slight. Income has an OR of 1, sex an OR of 0.934 with a 95% CI range of 0.928974 - 0.939662 (so probability of drinking more than 30 drinks/month is lower in men than in women), and age also has an OR of roughly 1.

Do the results support the hypothesis?

Technically this data says there’s a statistically significant difference and it supports my hypothesis. However, it also says that the difference is very minimal.

Evidence of confounding?

Somewhat. Alcoholic family only has a p-value of less than 0.05 when sex and age are factored in. Without it, it’s not significant. That’s more like the opposite of confounding, but I still think it’s interesting. It makes me wonder if it’s real.

Logistic regression output:

OLS Regression Results ============================================================================== Dep. Variable: DAYDRINKING R-squared: 0.021 Model: OLS Adj. R-squared: 0.021 Method: Least Squares F-statistic: 165.2 Date: Fri, 22 Mar 2019 Prob (F-statistic): 3.35e-140 Time: 16:39:20 Log-Likelihood: 138.17 No. Observations: 31250 AIC: -266.3 Df Residuals: 31245 BIC: -224.6 Df Model: 4 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 0.1062 0.002 45.255 0.000 0.102 0.111 ADJPERSINCOME 2.076e-07 3.62e-08 5.730 0.000 1.37e-07 2.79e-07 SEX -0.0680 0.003 -23.287 0.000 -0.074 -0.062 AGE_c 0.0004 7.22e-05 4.858 0.000 0.000 0.000 AAFAM 0.0098 0.005 2.179 0.029 0.001 0.019 ============================================================================== Omnibus: 21705.905 Durbin-Watson: 2.004 Prob(Omnibus): 0.000 Jarque-Bera (JB): 204857.060 Skew: 3.476 Prob(JB): 0.00 Kurtosis: 13.441 Cond. No. 1.28e+05 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.28e+05. This might indicate that there are strong multicollinearity or other numerical problems. Odds Ratio Intercept 1.112039 ADJPERSINCOME 1.000000 SEX 0.934303 AGE_c 1.000351 AAFAM 1.009890 dtype: float64 Lower CI Upper CI OR Intercept 1.106936 1.117165 1.112039 ADJPERSINCOME 1.000000 1.000000 1.000000 SEX 0.928974 0.939662 0.934303 AGE_c 1.000209 1.000492 1.000351 AAFAM 1.000989 1.018870 1.009890 OLS Regression Results ============================================================================== Dep. Variable: DAYDRINKING R-squared: 0.020 Model: OLS Adj. R-squared: 0.020 Method: Least Squares F-statistic: 209.1 Date: Fri, 22 Mar 2019 Prob (F-statistic): 2.70e-134 Time: 16:39:20 Log-Likelihood: 121.76 No. Observations: 31250 AIC: -235.5 Df Residuals: 31246 BIC: -202.1 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.1082 0.002 46.653 0.000 0.104 0.113 AAFAM 0.0099 0.005 2.200 0.028 0.001 0.019 SEX -0.0711 0.003 -24.823 0.000 -0.077 -0.066 AGE_c 0.0004 7.22e-05 5.033 0.000 0.000 0.001 ============================================================================== Omnibus: 21737.572 Durbin-Watson: 2.004 Prob(Omnibus): 0.000 Jarque-Bera (JB): 205558.677 Skew: 3.482 Prob(JB): 0.00 Kurtosis: 13.458 Cond. No. 63.4 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: DAYDRINKING R-squared: 0.000 Model: OLS Adj. R-squared: 0.000 Method: Least Squares F-statistic: 5.454 Date: Fri, 22 Mar 2019 Prob (F-statistic): 0.00428 Time: 16:39:20 Log-Likelihood: -183.37 No. Observations: 31250 AIC: 372.7 Df Residuals: 31247 BIC: 397.8 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.0631 0.001 43.366 0.000 0.060 0.066 AGE_c 0.0002 7.27e-05 3.302 0.001 9.76e-05 0.000 AAFAM 0.0015 0.005 0.333 0.739 -0.007 0.010 ============================================================================== Omnibus: 22290.028 Durbin-Watson: 2.003 Prob(Omnibus): 0.000 Jarque-Bera (JB): 220736.719 Skew: 3.586 Prob(JB): 0.00 Kurtosis: 13.867 Cond. No. 63.4 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: DAYDRINKING R-squared: 0.000 Model: OLS Adj. R-squared: -0.000 Method: Least Squares F-statistic: 0.002066 Date: Fri, 22 Mar 2019 Prob (F-statistic): 0.964 Time: 16:39:20 Log-Likelihood: -188.82 No. Observations: 31250 AIC: 381.6 Df Residuals: 31248 BIC: 398.3 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.0633 0.001 43.512 0.000 0.060 0.066 AAFAM -0.0002 0.005 -0.045 0.964 -0.009 0.009 ============================================================================== Omnibus: 22300.061 Durbin-Watson: 2.003 Prob(Omnibus): 0.000 Jarque-Bera (JB): 221025.956 Skew: 3.588 Prob(JB): 0.00 Kurtosis: 13.874 Cond. No. 3.32 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Raw code:

parents = np.dstack((subaa1["S2DQ1"],subaa1["S2DQ2"])) #making variables of interest into a single array print(parents.shape) #double checking parents2 = np.nansum(parents,2) #adding the two elements together on the correct dimension print(parents2.T.shape) #checking that if I transpose it again it's now in a column once more subaa1["AAFAMPAR2"]=parents2.T #adding the column into the dataset as a new variable extfam = np.dstack((subaa1["S2DQ7C2"],subaa1["S2DQ8C2"],subaa1["S2DQ9C2"],subaa1["S2DQ10C2"],subaa1["S2DQ11"],subaa1["S2DQ12"],subaa1["S2DQ13A"],subaa1["S2DQ13B"])) print(extfam.shape) extfam2=np.nansum(extfam, 2) print(extfam2.shape) subaa1["AAFAMEXT2"]=extfam2.T

def AAFAM(row): if row["AAFAMPAR2"]>=1 and row["AAFAMEXT2"]>=1: return 1 #alcoholic family else: return 0 #No known alcoholic family history

subaa1["AAFAM"] = subaa1.apply(lambda row: AAFAM(row), axis=1)

def DAYDRINKING(row): if row["DRINKMO"]>30: return 1 #drinks approximately every day or more else: return 0 #drinks less than every day

subaa1["DAYDRINKING"] = subaa1.apply(lambda row: DAYDRINKING(row), axis=1)

reg1=smf.ols("DAYDRINKING ~ ADJPERSINCOME + SEX+ AGE_c + AAFAM", data=subaa1).fit() print(reg1.summary()) print("Odds Ratio") print(np.exp(reg1.params)) #Can also get a confidence interval params1=reg1.params conf1=reg1.conf_int() conf1["OR"]=params1 conf1.columns=["Lower CI", "Upper CI", "OR"] #The confidence intervals for the response variables overlap. Cannot tell which one is more strongly associated. print(np.exp(conf1))

reg2=smf.ols("DAYDRINKING ~ AAFAM + SEX +AGE_c", data=subaa1).fit() print(reg2.summary()) #AAFAM appears to have no affect by itself

reg3=smf.ols("DAYDRINKING ~ AGE_c + AAFAM", data=subaa1).fit() print(reg3.summary()) #SEX confounds income and alcoholic family

reg4=smf.ols("DAYDRINKING ~ AAFAM", data=subaa1).fit() print(reg4.summary()) #AAFAM is still signifiant without income

0 notes

kpenlearnspython-blog · 6 years ago

Text

Lesson 3 Week 3

Full disclosure, I’m sick this week, and I don’t have the energy to make this data look any better. In summary, as best I can tell, these things don’t correlate very well. If you have any advice for something obvious I did wrong that resulted in such poor fit, I’d appreciate the feedback.

Second disclosure...just watched the first video for next week and I learned that since ALCCHOICE is a categorical variable with more than two terms, what I did for it was incorrect. I missed this in the Week 3 lectures. So, while I executed everything okay, I did the analysis incorrectly.

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

“DRINKMO” is my number of drinks/month category. subaa1 has already been segmented from the primary data to contain only those who have no reported alcohol abuse or dependence.

I decided to just stick with three explanatory variables (two quantitative, one more categorical) to see what influence they would have. I initially tried to factor in more, but it was too much data and the last graph wouldn’t run.

Summary:

I wanted to look at how the explanatory variables total household income, age, and alcohol choice (non-hard, hard, both, non-drinking) would affect the response variable of number of drinks consumed per month. Total household income was my primary explanatory variable.

When I initially looked at only the relationship between the total household income zeroed around the mean and drinks/month, I get a low p-value of ~0, but a really low coefficient of 3.482e-05. The Rsquared value is also very low at 0.005. So, while these things are correlated, it doesn’t seem very strong. I did try to adjust this my introducing a quadratic formula, but it didn’t really improve the model much.

I next tried to incorporate age zeroed around the mean with this model to see if it would change anything. With this, the p-values for both age and income were both ~0. Age had a slightly more significant coefficient at -0.0272, while income remained low at 3.506e-05. These two things do not really seem to affect each other much. The Rsquared value remained the same.

When I added in the additional variable of alcohol choice, age no longer became significant (suggesting that perhaps alcohol choice is a confounding variable for age). While alcohol choice and total household income still had p-values of ~0, the age p-value changed to 0.451. The alcohol choice coefficient was -2.9505, making this actually the explanatory variable that looks like it has the strongest influence. The income coefficient still remained low at 2.659e-05. The Rsquared improved slightly, at 0.044, so now the model can explain about 4% of the variability.

Did the results support the hypothesis between the primary explanatory and response variable?

While I have not found a confounding variable for the association between income and drinks/month, the fact that the coefficients seems so weak, the qqplot shows that the data aren’t clearly following a normal distribution. The standardized residuals are also showing that MANY values are higher than the expected standard deviation.

Evidence of confounding?

There is evidence that the type of alcohol consumed acts as a confounding variable for age, as the p-value for age increases once the type of alcohol consumed is considered.

Regression Diagnostic plots:

My QQ-plot with all of the variables considered clearly shows deviation at the higher theoretical quantiles. I tried switching things to quadratics, and that didn’t help. But clearly, there are other variables at play that we’re not accounting for here. This is sort of the general theme for all of this data...

As for standardized residuals, this is supposed to tell you whether your values are in an appropriate standard deviation from the mean. It’s pretty clear that more than the acceptable range are outside of the absolute value of >2.5. This current model isn’t great and I’m probably missing important variables or confounding variables.

This is my regression diagnostic for income. Hard to really get much from this since the fit is poor. Correlation is non-linear.

Once again, this has the same issue with really high residuals. Adding this variable to the rest of the data doesn’t seem to make it a linear relationship (though it does look better than the previous plot). Once again, I’m clearly missing additional explanatory variables.

There are some very high residuals. It almost looks like a linear relationship, but this it the one variable I know isn’t statistically significant.

This is sort of the icing on the cake in terms of demonstrating that this model is bad. There are so many things considered outliers that it’s really hard to evaluate anything.

Multiple regression output:

OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.005 Model: OLS Adj. R-squared: 0.005 Method: Least Squares F-statistic: 148.6 Date: Sun, 17 Mar 2019 Prob (F-statistic): 4.15e-34 Time: 12:01:35 Log-Likelihood: -1.3601e+05 No. Observations: 31052 AIC: 2.720e+05 Df Residuals: 31050 BIC: 2.720e+05 Df Model: 1 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 6.5884 0.110 60.094 0.000 6.373 6.803 ADJPERSINCOME 3.482e-05 2.86e-06 12.191 0.000 2.92e-05 4.04e-05 ============================================================================== Omnibus: 40363.894 Durbin-Watson: 1.986 Prob(Omnibus): 0.000 Jarque-Bera (JB): 10592652.775 Skew: 7.209 Prob(JB): 0.00 Kurtosis: 92.326 Cond. No. 3.84e+04 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.84e+04. This might indicate that there are strong multicollinearity or other numerical problems.

OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.005 Model: OLS Adj. R-squared: 0.005 Method: Least Squares F-statistic: 85.52 Date: Sun, 17 Mar 2019 Prob (F-statistic): 9.11e-38 Time: 12:05:05 Log-Likelihood: -1.3600e+05 No. Observations: 31052 AIC: 2.720e+05 Df Residuals: 31049 BIC: 2.720e+05 Df Model: 2 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 6.5884 0.110 60.116 0.000 6.374 6.803 ADJPERSINCOME 3.506e-05 2.86e-06 12.278 0.000 2.95e-05 4.07e-05 AGE_c -0.0272 0.006 -4.725 0.000 -0.038 -0.016 ============================================================================== Omnibus: 40351.209 Durbin-Watson: 1.987 Prob(Omnibus): 0.000 Jarque-Bera (JB): 10568506.250 Skew: 7.206 Prob(JB): 0.00 Kurtosis: 92.223 Cond. No. 3.84e+04 ==============================================================================

OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.044 Model: OLS Adj. R-squared: 0.044 Method: Least Squares F-statistic: 476.4 Date: Sun, 17 Mar 2019 Prob (F-statistic): 1.12e-302 Time: 12:07:12 Log-Likelihood: -1.3538e+05 No. Observations: 31052 AIC: 2.708e+05 Df Residuals: 31048 BIC: 2.708e+05 Df Model: 3 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 11.8562 0.184 64.563 0.000 11.496 12.216 ADJPERSINCOME 2.659e-05 2.81e-06 9.463 0.000 2.11e-05 3.21e-05 AGE_c 0.0043 0.006 0.753 0.451 -0.007 0.015 ALCCHOICE -2.9505 0.083 -35.374 0.000 -3.114 -2.787 ============================================================================== Omnibus: 40770.198 Durbin-Watson: 1.991 Prob(Omnibus): 0.000 Jarque-Bera (JB): 11585537.749 Skew: 7.316 Prob(JB): 0.00 Kurtosis: 96.490 Cond. No. 7.02e+04 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 7.02e+04. This might indicate that there are strong multicollinearity or other numerical problems.

All raw code:

subaa1_2=subaa1[["DRINKMO","ADJPERSINCOME","SEX","ETH", "AGE_c", "AAFAM", "ALCCHOICE"]].dropna()

modeltest1 = smf.ols(formula="DRINKMO ~ ADJPERSINCOME", data=subaa1_2).fit() print(modeltest1.summary()) fig1=sm.qqplot(modeltest1.resid, line='r') #There is really high deviation at the upper end of the mean, somewhat at lower stdres1 = pd.DataFrame(modeltest1.resid_pearson) #convert array of standardized residuals to dataframe. Reg3 has the results of the data analysis. resid tells python to use standardized residuals fig1_1=plt.plot(stdres1, 'o', ls='None') #generate a plot of standardized residuals. 'o' tells python to use dots. ls='None' tells python not to connec the markers l = plt.axhline(y=0, color='r') #draws horizontal line on the graph plt.ylabel('Standardized Residual') plt.xlabel('Observation Number') print(fig1_1) #Tons are falling out of residual plot. So the current model is unacceptable. Does adding more help?

modeltest2 = smf.ols(formula="DRINKMO ~ ADJPERSINCOME + AGE_c", data=subaa1_2).fit() print(modeltest2.summary()) fig2=sm.qqplot(modeltest2.resid, line='r') stdres2 = pd.DataFrame(modeltest2.resid_pearson) #convert array of standardized residuals to dataframe. Reg3 has the results of the data analysis. resid tells python to use standardized residuals fig2_1=plt.plot(stdres2, 'o', ls='None') #generate a plot of standardized residuals. 'o' tells python to use dots. ls='None' tells python not to connec the markers l = plt.axhline(y=0, color='r') #draws horizontal line on the graph plt.ylabel('Standardized Residual') plt.xlabel('Observation Number') print(fig2_1)

modeltest3 = smf.ols(formula="DRINKMO ~ ADJPERSINCOME + AGE_c + ALCCHOICE", data=subaa1_2).fit() print(modeltest3.summary()) fig3=sm.qqplot(modeltest3.resid, line='r') stdres3 = pd.DataFrame(modeltest3.resid_pearson) #convert array of standardized residuals to dataframe. Reg3 has the results of the data analysis. resid tells python to use standardized residuals fig3_1=plt.plot(stdres3, 'o', ls='None') #generate a plot of standardized residuals. 'o' tells python to use dots. ls='None' tells python not to connec the markers l = plt.axhline(y=0, color='r') #draws horizontal line on the graph plt.ylabel('Standardized Residual') plt.xlabel('Observation Number') print(fig3_1) #when I add alcchoice in, age is no longer relevant. fig3_2 = plt.figure(figsize=(12,8)) #numbers specify size of the plot image in pixels fig3_2 = sm.graphics.plot_regress_exog(modeltest3, "AGE_c", fig=fig3_2) #put in residual results and the explanatory variable that you want to plot print(fig3_2)

fig3_3 = sm.graphics.influence_plot(modeltest3, size=8) print(fig3_3)

0 notes

kpenlearnspython-blog · 6 years ago

Text

Lesson 3 Week 2

Truthfully, I’m not sure if I understood this assignment since I thought we were going to learn how to deal with confounding variables, but this seems to just be asking for general linear regression. That I can do, but I feel like I’m missing something?

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

One confounding variable that I could see affecting number of drinks consumed per month vs family history is personal income. I wanted to look at the linear regression of number of drinks/month (response variable) vs the personal income (explanatory variable).

subaa1 is my main dataset (in previous posts) where I have filtered out people with a history of alcohol abuse/dependence.

I centered the personal income data around the mean.

subaa1["ADJPERSINCOME"]=(subaa1["S1Q10A"]-subaa1["S1Q10A"].mean())

I then just narrowed my dataset to the DRINKMO data (which I have previously derived as the average # of drinks per month for each participant) and ADJPERSINCOME. I made sure everything was numeric.

subaa14=subaa1[["DRINKMO","ADJPERSINCOME"]].dropna() subaa14["DRINKMO"]=pd.to_numeric(subaa14["DRINMO"]) subaa14["ADJPERSINCOME"]=pd.to_numeric(subaa14["ADJPERSINCOME"])

Then I ran the regression analysis.

modeldrinkincome = smf.ols(formula="DRINKMO ~ ADJPERSINCOME", data=subaa14).fit() print(modeldrinkincome.summary()) sb.regplot(y="DRINKMO", x="ADJPERSINCOME", data=subaa14) plt.xlabel("Income Adjusted Around Mean") plt.ylabel("Number of drinks/month")

OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.005 Model: OLS Adj. R-squared: 0.005 Method: Least Squares F-statistic: 148.6 Date: Sun, 10 Mar 2019 Prob (F-statistic): 4.15e-34 Time: 12:18:56 Log-Likelihood: -1.3601e+05 No. Observations: 31052 AIC: 2.720e+05 Df Residuals: 31050 BIC: 2.720e+05 Df Model: 1 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 6.5884 0.110 60.094 0.000 6.373 6.803 ADJPERSINCOME 3.482e-05 2.86e-06 12.191 0.000 2.92e-05 4.04e-05 ============================================================================== Omnibus: 40363.894 Durbin-Watson: 1.986 Prob(Omnibus): 0.000 Jarque-Bera (JB): 10592652.775 Skew: 7.209 Prob(JB): 0.00 Kurtosis: 92.326 Cond. No. 3.84e+04 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.84e+04. This might indicate that there are strong multicollinearity or other numerical problems.

The general p-value is 4.15e-34, so it is statistically significant. The regression coefficient is 3.482e-5, and is the slope of the line. The p-value of the explanatory variable to the response variable is 0 (probably not exactly, but that’s the closest it could give me), so it should be less than 0.05. Finally, the So, while income does seem to have a positive relation to number of drinks/month, it’s very weak. This is further emphasized by the R-squared value, which is 0.005. This is so far away from 1 that it indicates that the relationship there just isn’t very strong.

I tried to adjust this data down to a certain income range and only for people that drink, as I worried that the higher values weren’t truly representative and were skewing the data.

subaa14_1 = subaa14[(subaa14["ADJPERSINCOME"]<=50000)&(subaa14["DRINKMO"]>=1)] subaa14_1["ADJPERSINCOME"].max() subaa14_1["DRINKMO"].min() modeldrinkincome = smf.ols(formula="DRINKMO ~ ADJPERSINCOME", data=subaa14_1).fit() print(modeldrinkincome.summary()) sb.regplot(y="DRINKMO", x="ADJPERSINCOME", data=subaa14_1) plt.xlabel("Income Adjusted Around Mean") plt.ylabel("Number of drinks/month")

OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.000 Model: OLS Adj. R-squared: -0.000 Method: Least Squares F-statistic: 0.8429 Date: Sun, 10 Mar 2019 Prob (F-statistic): 0.359 Time: 12:33:21 Log-Likelihood: -49018. No. Observations: 10208 AIC: 9.804e+04 Df Residuals: 10206 BIC: 9.805e+04 Df Model: 1 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 18.3495 0.292 62.930 0.000 17.778 18.921 ADJPERSINCOME -1.445e-05 1.57e-05 -0.918 0.359 -4.53e-05 1.64e-05 ============================================================================== Omnibus: 10113.534 Durbin-Watson: 1.947 Prob(Omnibus): 0.000 Jarque-Bera (JB): 701087.094 Skew: 4.793 Prob(JB): 0.00 Kurtosis: 42.451 Cond. No. 1.85e+04 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.85e+04. This might indicate that there are strong multicollinearity or other numerical problems.

With this adjustment, the p-value is now 0.359, so not below 0.05. The slope is now negative, and in general, it seems there’s really no strong correlation, as we expected before.

Based upon this data, I’m going to assume that income isn’t too much of a confounding variable from my initial dataset.

0 notes

kpenlearnspython-blog · 6 years ago

Text

Writing About Your Data

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

1) My sample:

The NESARC data set was collected in 2001-2002 among people in the US that were 18 years or older and that were non-institutionalized. They had 43,093 respondents.

Specifically, I’m looking at a subset of this data for people with no history of alcohol abuse or dependence.

The level of analysis studied would be in group (either those with and without family history of alcohol abuse/dependence OR subcategories of family history of alcoholism).

For those with no history of alcohol abuse or dependence, there are 31250 observations.

2) The data collection procedure:

The NESARC data was gathered by a survey.

The data was collected by a diagnostic interview where questions were asked in typically one hour. Diagnoses of disorders was made according to DSM-IV. It did not allow skip-outs of questions even when a subject had answered enough questions to confirm a diagnosis. They asked about these symptoms in the last 12 months or prior to the last 12 months to identify if there were instances of full or partial remission. They had a test-retest, where a subset (~400) of the respondents were asked to re-do the survey again, but this time with a different interviewer.

The purpose of designing the survey like this was because no one had done such an extensive survey before. In addition to sample size, many previous studies diagnosed people with certain disorders with either outdated or questionable methods.

The data were collected from 2001-2002.

The respondents of the data set were in the United States (including Alaska and Hawaii) and the District of Colombia.

3) My variables and how I managed them to address my research question:

My explanatory variable is family history of alcoholism. My response variable is how many drinks/month.

In terms of response scales, most of my values are quantitative (rolling numbers), a yes or no (1 or 0), or categories.

I’ve used “How often have you drank alcohol in the last 12 months” (S2AQ8A) and the “Number of drinks on days when drinking” (S2AQ8B) to make a drinks/month (DRINKMO) category to get an idea of how much alcohol is actually consumed.

I also eliminated people who have been previously diagnosed with alcohol abuse or dependence from my data set.

I have tried to subset family relationships into parents and more distant relatives (grandparents, uncles, aunts), or some combination. I think it might be advisable to generalize this category a bit more in the future or do some different type of subsetting, but I haven’t decided how to do it yet.

I tried to look at how age affected these results in the previous lesson by dividing age into categories of younger than 25, between 25 and 50, and older than 50. This did not have much affect on the data.

It might be further more interesting to divide my data into people who drink hard alcohol (some sort of liquor) vs people who drink things with a lower alcohol content (like beer and wine). Obviously, these drinks affect you differently, so it would be interesting to see if my trends from the general data set remain true here (from my previous data, there really isn’t a big difference between those with no family history of alcoholism and those that have family history EXCEPT in the case with two alcoholic parents, though even there I feel like the error bar is suspiciously large).

Other potential confounding factors could be ethnicity, income, and sex.

0 notes

kpenlearnspython-blog · 6 years ago

Text

Week 4: Moderators

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

I want to see if age group could act as a moderator for the relationship between family history of alcoholism and drinks/month.

subaa1 is a subset of the data that I’ve previously made that contains only those who have not displayed any history of alcohol abuse or dependence.

All code will be in block text. Code is italicized, while the output is not.

subaa1["AGE"]=2002-subaa1["DOBY"]

subaa1["AGE"].head(n=10)

subaa13=subaa1[["DRINKMO", "AAFAM2", "AGE"]].dropna()

subaa13.head(n=25)

subaa1_25=subaa13[(subaa13["AGE"]<=25)] subaa1_50=subaa13[(subaa13["AGE"]>25) & (subaa13["AGE"]<=50)] subaa1_old=subaa13[(subaa13["AGE"]>50)]

#25 and under print("association between family history of alcoholism and drinks/month for those 25 and under") aafammod25 = smf.ols(formula="DRINKMO~C(AAFAM2)", data=subaa1_25).fit() print(aafammod25.summary())

mc_aafam25 = multi.MultiComparison(subaa1_25["DRINKMO"], subaa1_25["AAFAM2"]) res_aafam25 = mc_aafam25.tukeyhsd() #Request the test print(res_aafam25.summary())

sb.factorplot(x='AAFAM2', y='DRINKMO', data=subaa1_25, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Avg of Drinks/Month') plt.xticks(rotation = 45)

print(subaa1_25.groupby("AAFAM2").mean()) print(subaa1_25.groupby("AAFAM2").std())

#26-50 print("association between family history of alcoholism and drinks/month for those from 26-50") aafammod50 = smf.ols(formula="DRINKMO~C(AAFAM2)", data=subaa1_50).fit() print(aafammod50.summary())

mc_aafam50 = multi.MultiComparison(subaa1_50["DRINKMO"], subaa1_50["AAFAM2"]) res_aafam50 = mc_aafam50.tukeyhsd() #Request the test print(res_aafam50.summary())

sb.factorplot(x='AAFAM2', y='DRINKMO', data=subaa1_50, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Avg of Drinks/Month') plt.xticks(rotation = 45)

print(subaa1_50.groupby("AAFAM2").mean()) print(subaa1_50.groupby("AAFAM2").std())

#50+ print("association between family history of alcoholism and drinks/month for those over 50") aafammodold = smf.ols(formula="DRINKMO~C(AAFAM2)", data=subaa1_old).fit() print(aafammodold.summary())

mc_aafamold = multi.MultiComparison(subaa1_old["DRINKMO"], subaa1_old["AAFAM2"]) res_aafamold = mc_aafamold.tukeyhsd() #Request the test print(res_aafamold.summary())

sb.factorplot(x='AAFAM2', y='DRINKMO', data=subaa1_old, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Avg of Drinks/Month') plt.xticks(rotation = 45)

print(subaa1_old.groupby("AAFAM2").mean()) print(subaa1_old.groupby("AAFAM2").std())

association between family history of alcoholism and drinks/month for those 25 and under OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.003 Model: OLS Adj. R-squared: 0.002 Method: Least Squares F-statistic: 2.115 Date: Thu, 21 Feb 2019 Prob (F-statistic): 0.0485 Time: 21:51:49 Log-Likelihood: -16800. No. Observations: 3724 AIC: 3.361e+04 Df Residuals: 3717 BIC: 3.366e+04 Df Model: 6 Covariance Type: nonrobust ============================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------------- Intercept 7.0355 1.604 4.387 0.000 3.891 10.180 C(AAFAM2)[T.2 Par] 2.8673 5.072 0.565 0.572 -7.077 12.811 C(AAFAM2)[T.1 ExtRel] -1.7358 1.858 -0.934 0.350 -5.379 1.908 C(AAFAM2)[T.>1 ExtRel] 0.2085 1.916 0.109 0.913 -3.547 3.964 C(AAFAM2)[T.1 Par, ExtRel] 2.4260 1.919 1.264 0.206 -1.336 6.188 C(AAFAM2)[T.2 Par, ExtRel] 1.4316 2.995 0.478 0.633 -4.440 7.303 C(AAFAM2)[T.None known] -1.1260 1.678 -0.671 0.502 -4.415 2.163 ============================================================================== Omnibus: 5663.796 Durbin-Watson: 1.950 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3077895.731 Skew: 9.325 Prob(JB): 0.00 Kurtosis: 142.600 Cond. No. 17.6 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================================ group1 group2 meandiff lower upper reject ------------------------------------------------------------ 1 ExtRel 1 Par 1.7358 -3.7459 7.2174 False 1 ExtRel 1 Par, ExtRel 4.1617 -0.0004 8.3239 False 1 ExtRel 2 Par 4.603 -9.8582 19.0642 False 1 ExtRel 2 Par, ExtRel 3.1674 -4.7907 11.1255 False 1 ExtRel >1 ExtRel 1.9442 -2.2048 6.0933 False 1 ExtRel None known 0.6098 -2.5166 3.7362 False 1 Par 1 Par, ExtRel 2.426 -3.2347 8.0867 False 1 Par 2 Par 2.8673 -12.0942 17.8288 False 1 Par 2 Par, ExtRel 1.4316 -7.4031 10.2663 False 1 Par >1 ExtRel 0.2085 -5.4426 5.8596 False 1 Par None known -1.126 -6.0752 3.8233 False 1 Par, ExtRel 2 Par 0.4413 -14.0887 14.9713 False 1 Par, ExtRel 2 Par, ExtRel -0.9944 -9.0768 7.0881 False 1 Par, ExtRel >1 ExtRel -2.2175 -6.6003 2.1653 False 1 Par, ExtRel None known -3.5519 -6.9826 -0.1213 True 2 Par 2 Par, ExtRel -1.4357 -17.4709 14.5996 False 2 Par >1 ExtRel -2.6588 -17.1851 11.8675 False 2 Par None known -3.9932 -18.2611 10.2746 False 2 Par, ExtRel >1 ExtRel -1.2231 -9.2988 6.8526 False 2 Par, ExtRel None known -2.5576 -10.1587 5.0436 False >1 ExtRel None known -1.3344 -4.7491 2.0803 False ------------------------------------------------------------ DRINKMO AGE AAFAM2 1 Par 7.035494 22.142857 2 Par 9.902778 21.523810 1 ExtRel 5.299743 22.070652 >1 ExtRel 7.243980 21.936795 1 Par, ExtRel 9.461473 22.086758 2 Par, ExtRel 8.467105 22.131579 None known 5.909539 21.992020 DRINKMO AGE AAFAM2 1 Par 18.178809 2.157548 2 Par 15.561367 1.721019 1 ExtRel 16.643207 1.956999 >1 ExtRel 24.958553 1.949725 1 Par, ExtRel 31.667958 1.995821 2 Par, ExtRel 19.649860 2.022158 None known 20.534125 2.010932 association between family history of alcoholism and drinks/month for those from 26-50 OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.001 Model: OLS Adj. R-squared: 0.001 Method: Least Squares F-statistic: 3.404 Date: Thu, 21 Feb 2019 Prob (F-statistic): 0.00234 Time: 21:51:50 Log-Likelihood: -61113. No. Observations: 13914 AIC: 1.222e+05 Df Residuals: 13907 BIC: 1.223e+05 Df Model: 6 Covariance Type: nonrobust ============================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------------- Intercept 8.3744 0.672 12.453 0.000 7.056 9.693 C(AAFAM2)[T.2 Par] 6.3325 2.214 2.860 0.004 1.993 10.672 C(AAFAM2)[T.1 ExtRel] -0.9536 0.808 -1.180 0.238 -2.537 0.630 C(AAFAM2)[T.>1 ExtRel] -1.9459 0.870 -2.237 0.025 -3.651 -0.241 C(AAFAM2)[T.1 Par, ExtRel] -1.6592 0.838 -1.980 0.048 -3.302 -0.016 C(AAFAM2)[T.2 Par, ExtRel] -1.7858 1.472 -1.213 0.225 -4.672 1.100 C(AAFAM2)[T.None known] -1.5310 0.707 -2.166 0.030 -2.917 -0.145 ============================================================================== Omnibus: 17507.488 Durbin-Watson: 1.993 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3536212.632 Skew: 6.858 Prob(JB): 0.00 Kurtosis: 79.886 Cond. No. 16.8 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================================ group1 group2 meandiff lower upper reject ------------------------------------------------------------ 1 ExtRel 1 Par 0.9536 -1.4286 3.3358 False 1 ExtRel 1 Par, ExtRel -0.7055 -2.6847 1.2736 False 1 ExtRel 2 Par 7.2861 0.9281 13.6441 True 1 ExtRel 2 Par, ExtRel -0.8322 -4.9139 3.2495 False 1 ExtRel >1 ExtRel -0.9923 -3.0868 1.1022 False 1 ExtRel None known -0.5774 -2.0455 0.8906 False 1 Par 1 Par, ExtRel -1.6592 -4.1303 0.812 False 1 Par 2 Par 6.3325 -0.1955 12.8604 False 1 Par 2 Par, ExtRel -1.7858 -6.1275 2.5558 False 1 Par >1 ExtRel -1.9459 -4.5104 0.6186 False 1 Par None known -1.531 -3.6155 0.5534 False 1 Par, ExtRel 2 Par 7.9916 1.5997 14.3835 True 1 Par, ExtRel 2 Par, ExtRel -0.1267 -4.2609 4.0076 False 1 Par, ExtRel >1 ExtRel -0.2867 -2.4819 1.9084 False 1 Par, ExtRel None known 0.1281 -1.4803 1.7365 False 2 Par 2 Par, ExtRel -8.1183 -15.4395 -0.7971 True 2 Par >1 ExtRel -8.2784 -14.7069 -1.8498 True 2 Par None known -7.8635 -14.1161 -1.611 True 2 Par, ExtRel >1 ExtRel -0.16 -4.3508 4.0307 False 2 Par, ExtRel None known 0.2548 -3.6606 4.1702 False >1 ExtRel None known 0.4148 -1.3336 2.1633 False ------------------------------------------------------------ DRINKMO AGE AAFAM2 1 Par 8.374409 38.841608 2 Par 14.706880 38.313953 1 ExtRel 7.420792 37.931378 >1 ExtRel 6.428524 37.107313 1 Par, ExtRel 6.715251 37.884314 2 Par, ExtRel 6.588565 38.354260 None known 6.843365 38.387373 DRINKMO AGE AAFAM2 1 Par 24.112929 6.921776 2 Par 41.859383 7.081478 1 ExtRel 21.230342 7.083139 >1 ExtRel 15.040908 7.103761 1 Par, ExtRel 19.075674 6.972548 2 Par, ExtRel 20.678330 6.942172 None known 18.917645 6.847600 association between family history of alcoholism and drinks/month for those over 50 OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.001 Model: OLS Adj. R-squared: 0.001 Method: Least Squares F-statistic: 3.216 Date: Thu, 21 Feb 2019 Prob (F-statistic): 0.00371 Time: 21:51:50 Log-Likelihood: -57808. No. Observations: 13358 AIC: 1.156e+05 Df Residuals: 13351 BIC: 1.157e+05 Df Model: 6 Covariance Type: nonrobust ============================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------------- Intercept 7.6742 0.638 12.021 0.000 6.423 8.926 C(AAFAM2)[T.2 Par] 2.6966 2.471 1.091 0.275 -2.147 7.540 C(AAFAM2)[T.1 ExtRel] -0.8301 0.777 -1.068 0.286 -2.354 0.694 C(AAFAM2)[T.>1 ExtRel] -1.7252 0.940 -1.836 0.066 -3.567 0.117 C(AAFAM2)[T.1 Par, ExtRel] -0.8896 0.897 -0.992 0.321 -2.648 0.869 C(AAFAM2)[T.2 Par, ExtRel] 2.0869 1.869 1.117 0.264 -1.576 5.750 C(AAFAM2)[T.None known] -1.8438 0.667 -2.766 0.006 -3.151 -0.537 ============================================================================== Omnibus: 16026.030 Durbin-Watson: 1.983 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2682859.110 Skew: 6.338 Prob(JB): 0.00 Kurtosis: 71.261 Cond. No. 19.9 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================================ group1 group2 meandiff lower upper reject ------------------------------------------------------------ 1 ExtRel 1 Par 0.8301 -1.4621 3.1223 False 1 ExtRel 1 Par, ExtRel -0.0595 -2.3315 2.2125 False 1 ExtRel 2 Par 3.5267 -3.633 10.6864 False 1 ExtRel 2 Par, ExtRel 2.917 -2.4245 8.2585 False 1 ExtRel >1 ExtRel -0.8951 -3.3128 1.5227 False 1 ExtRel None known -1.0137 -2.4391 0.4117 False 1 Par 1 Par, ExtRel -0.8896 -3.5344 1.7552 False 1 Par 2 Par 2.6966 -4.59 9.9831 False 1 Par 2 Par, ExtRel 2.0869 -3.4235 7.5973 False 1 Par >1 ExtRel -1.7252 -4.4962 1.0459 False 1 Par None known -1.8438 -3.8097 0.1221 False 1 Par, ExtRel 2 Par 3.5862 -3.6941 10.8664 False 1 Par, ExtRel 2 Par, ExtRel 2.9765 -2.5255 8.4785 False 1 Par, ExtRel >1 ExtRel -0.8356 -3.5899 1.9188 False 1 Par, ExtRel None known -0.9542 -2.8966 0.9882 False 2 Par 2 Par, ExtRel -0.6097 -9.3487 8.1294 False 2 Par >1 ExtRel -4.4217 -11.7488 2.9053 False 2 Par None known -4.5404 -11.6024 2.5216 False 2 Par, ExtRel >1 ExtRel -3.8121 -9.3759 1.7517 False 2 Par, ExtRel None known -3.9307 -9.1405 1.2791 False >1 ExtRel None known -0.1186 -2.2296 1.9923 False ------------------------------------------------------------ DRINKMO AGE AAFAM2 1 Par 7.674192 65.704242 2 Par 10.370763 63.050847 1 ExtRel 6.844085 66.106495 >1 ExtRel 5.949022 63.831683 1 Par, ExtRel 6.784583 61.682409 2 Par, ExtRel 9.761086 60.706422 None known 5.830390 68.811470 DRINKMO AGE AAFAM2 1 Par 21.722699 10.561760 2 Par 29.720298 11.193323 1 ExtRel 18.454500 10.870987 >1 ExtRel 15.015176 10.051776 1 Par, ExtRel 20.623544 8.663090 2 Par, ExtRel 37.674648 8.068346 None known 17.524113 11.430535

25 and under

26-50

51 and over

Subdividing the data based upon age does skew the scenarios in which that we can reject the null hypothesis that there is no relationship between family history of alcoholism and average drinks/month.

For those under 25, the only group where the null hypothesis is rejected is between an alcoholic parent and some extended family vs those with no known family with alcoholism.

For those from 26-50, the null hypothesis is rejected between those with 2 alcoholic parents and all the groups but those with 1 alcoholic parent.

For those that are over 50, despite the fact that the p-value is less than 0.05, the Tukey test for post hoc analysis does not find any groups that can reject the null hypothesis. This baffled me, but I found an explanation here.

To quote:

“It is possible that the overall mean of group A and group B combined differs significantly from the combined mean of groups C, D and E. Perhaps the mean of group A differs from the mean of groups B through E. Scheffe's post test detects differences like these (but this test is not offered by GraphPad InStat or Prism). If the overall ANOVA P value is less than 0.05, then Scheffe's test will definitely find a significant difference somewhere (if you look at the right comparison, also called contrast). The multiple comaprisons tests offered by GraphPad InStat and Prism only compare group means, and it is quite possible for the overall ANOVA to reject the null hypothesis that all group means are the same yet for the post test to find no significant difference among group means.”

This is likely what happened in this sample.

0 notes

kpenlearnspython-blog · 6 years ago

Text

Correlation Coefficiencts

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

Truthfully, my data doesn’t lend itself to two quantitative variables very well, so I’m going a off-topic and looking at the relationship between age and drinks/month.

All Python is blockquoted. Code is italicized. Output is not.

subaa1_dd=subaa1[["S2AQ8B", "DOBY", "DRINKMO"]].dropna()

max(subaa1_dd["DOBY"])

subaa1_dd["AGE"]=2002-subaa1_dd["DOBY"] #Study was published in 2002

subaa1_dd["AGE"].head(n=25)

plt.scatter(y="DRINKMO", x="AGE", data=subaa1_dd) plt.xlabel("Age") plt.ylabel("Drinks/month")

print("Association between number of number of drinks/month and age") print(sst.pearsonr(subaa1_dd["AGE"], subaa1_dd["DRINKMO"]))

Association between number of number of drinks/month and age (-0.025620117177638197, 6.447570986056389e-06)

The p-value is small (6.45e-6) to suggest that the relationship is statistically significant. The correlation coefficient (r=-0.0256) is close to 0 and suggests a very weak negative correlation. This suggests that as age increases, there is a weak decrease in the amount of drinks/month, though it’s almost negligible.

If I perform r^2, I get 0.0006563904041959118. As this is extremely small, it tells me that age and drinks/month are not good proxies to predict one another.

0 notes

kpenlearnspython-blog · 6 years ago

Text

Chi Square Test

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

2/11/19 edits: I realized I misunderstood the Bonferroni adjustment upon posting yesterday. Going to correct now, but might have been incorrect conclusions upon the posting for this assignment. Using strikethrough to denote previously wrong assessment.

I tried to blockquote all Python. Written code is italicized. Printed code is not.

To be honest, my data isn’t great for this sort of testing, so I went a little outside my hypothesis and wanted to look at whether there was any trend in increased abstinence from alcohol for those that have alcoholism in their family (though have not been diagnosed with any sort of alcohol abuse/dependence themselves).

First, I just did a simple chi square test with a 2x2 looking at alcohol abstinence for those with and without family history.

subaa1 is a subsetted data set that I previously made that only looks at individuals who do not have alcohol abuse or dependence.

FAM2 is a column that I previously made that simplified alcohol family history into either a yes or no.

subaa1["S2AQ1"]=subaa1["S2AQ1"].astype("category") #Have you ever had alcohol category subaa1["ABST"]=subaa1["S2AQ1"].cat.rename_categories(["Drinks", "Abstains"]) ct1=pd.crosstab(subaa1["ABST"], subaa1["FAM2"]) #categorical variables print(ct1) #get counts colsum=ct1.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct=ct1/colsum print(colpct)

print("chi-square value, p value, expected counts") cs1=sst.chi2_contingency(ct1) print(cs1)

FAM2 Family History No Family History ABST Abstains 2380 5886 Drinks 9494 13490 FAM2 Family History No Family History ABST Abstains 0.200438 0.303778 Drinks 0.799562 0.696222 chi-square value, p value, expected counts (403.6040870866473, 9.04438165422264e-90, 1, array([[ 3140.815488, 5125.184512], [ 8733.184512, 14250.815488]]))

The chi-square value (403.6) is much greater than 3.84, and the p-value (9.0e-90) is much less that 0.05, so I can reject the null hypothesis that there is no correlation between family history of alcoholism and whether a person drinks or abstains. From the table, it appears that those with a family history of alcoholism are more likely to drink than expected (79.9% obtained vs 73.5% expected).

To do a post hoc test, I decided to look at my categories for family history with alcoholism (1 parent, 2 parents, 1 extended relative, >1 extended relative, 1 parent+extended relatives, 2 parents + extended relative, no alcoholic family known) in relation to drinking vs abstaining.

ct2=pd.crosstab(subaa1["ABST"], subaa1["AAFAM2"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) #7 degrees of freedom

print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2) print("Expected chi-square for 7 degrees of freedom is 14.07.") print("Corrected p-value for 20 comparisons") 0.05/20

AAFAM2 1 Par 2 Par 1 ExtRel >1 ExtRel 1 Par, ExtRel 2 Par, ExtRel \ ABST Abstains 434 36 896 467 479 68 Drinks 1437 132 3285 1953 2345 342 AAFAM2 None known ABST Abstains 5886 Drinks 13490 AAFAM2 1 Par 2 Par 1 ExtRel >1 ExtRel 1 Par, ExtRel 2 Par, ExtRel \ ABST Abstains 0.231962 0.214286 0.214303 0.192975 0.169618 0.165854 Drinks 0.768038 0.785714 0.785697 0.807025 0.830382 0.834146 AAFAM2 None known ABST Abstains 0.303778 Drinks 0.696222 chi-square value, p value, expected counts (434.99112479242615, 8.331856012499931e-91, 6, array([[ 494.901952, 44.438016, 1105.924672, 640.11904 , 746.981888, 108.44992 , 5125.184512], [ 1376.098048, 123.561984, 3075.075328, 1779.88096 , 2077.018112, 301.55008 , 14250.815488]])) Expected chi-square for 7 degrees of freedom is 14.07. Corrected p-value for 20 comparisons Out[39]: 0.0025

Considering my expected chi-square (according to google) is 14.07, and I got 434.99, I can safely reject the null hypothesis that there is no correlation between specific family with alcoholism and alcohol abstinence.

I further did post-hoc analysis to find the groups with significant differences. Am looking for the Bonferroni adjusted p-value of 0.0025.

This is about 20 comparisons, so I will summarize the ones with significant differences here. For those numbered below, I can reject the null hypothesis that there is no difference in alcohol abstinence for their family history.

1 Parent vs. >1 Extended Relative

1 Parent vs. 1 Parent + Extended Relatives

1 Parent vs 2 Parents + Extended Relatives

1 Parent vs None known

2 Parents vs. None known

1 Extended Relative vs. >1 Extended Relative

1 Extended Relative vs. 1 Parent + Extended Relatives

1 Extended Relative vs. 2 Parents + Extended Relatives

1 Extended Relative vs. None known

1 Parent + Extended Relatives vs. >1 Extended Relative

>1 Extended Relative vs. None known

1 Parent + Extended Relatives vs. None known

2 Parents + Extended Relatives vs. None known

FAMCOMPv1 1 Par 2 Par ABST Abstains 434 36 Drinks 1437 132 FAMCOMPv1 1 Par 2 Par ABST Abstains 0.231962 0.214286 Drinks 0.768038 0.785714 chi-square value, p value, expected counts (0.18103193164993958, 0.6704879106105417, 1, array([[ 431.27513487, 38.72486513], [1439.72486513, 129.27513487]])) FAMCOMPv2 1 ExtRel 1 Par ABST Abstains 896 434 Drinks 3285 1437 FAMCOMPv2 1 ExtRel 1 Par ABST Abstains 0.214303 0.231962 Drinks 0.785697 0.768038 chi-square value, p value, expected counts (2.2488225420879204, 0.13371611345920698, 1, array([[ 918.82518176, 411.17481824], [3262.17481824, 1459.82518176]])) FAMCOMPv3 1 Par >1 ExtRel ABST Abstains 434 467 Drinks 1437 1953 FAMCOMPv3 1 Par >1 ExtRel ABST Abstains 0.231962 0.192975 Drinks 0.768038 0.807025 chi-square value, p value, expected counts (9.434649397276742, 0.002129237910827844, 1, array([[ 392.86203682, 508.13796318], [1478.13796318, 1911.86203682]])) FAMCOMPv4 1 Par 1 Par, ExtRel ABST Abstains 434 479 Drinks 1437 2345 FAMCOMPv4 1 Par 1 Par, ExtRel ABST Abstains 0.231962 0.169618 Drinks 0.768038 0.830382 chi-square value, p value, expected counts (27.52696632852945, 1.5491937328656087e-07, 1, array([[ 363.83876464, 549.16123536], [1507.16123536, 2274.83876464]])) FAMCOMPv5 1 Par 2 Par, ExtRel ABST Abstains 434 68 Drinks 1437 342 FAMCOMPv5 1 Par 2 Par, ExtRel ABST Abstains 0.231962 0.165854 Drinks 0.768038 0.834146 chi-square value, p value, expected counts (8.181860997484213, 0.004231133032437819, 1, array([[ 411.76764577, 90.23235423], [1459.23235423, 319.76764577]])) FAMCOMPv6 1 Par None known ABST Abstains 434 5886 Drinks 1437 13490 FAMCOMPv6 1 Par None known ABST Abstains 0.231962 0.303778 Drinks 0.768038 0.696222 chi-square value, p value, expected counts (41.76775439560775, 1.027846499574022e-10, 1, array([[ 556.53598155, 5763.46401845], [ 1314.46401845, 13612.53598155]])) FAMCOMPv7 2 Par None known ABST Abstains 36 5886 Drinks 132 13490 FAMCOMPv7 2 Par None known ABST Abstains 0.214286 0.303778 Drinks 0.785714 0.696222 chi-square value, p value, expected counts (5.899442601003834, 0.015145677174912012, 1, array([[ 50.90544413, 5871.09455587], [ 117.09455587, 13504.90544413]])) FAMCOMPv8 2 Par 2 Par, ExtRel ABST Abstains 36 68 Drinks 132 342 FAMCOMPv8 2 Par 2 Par, ExtRel ABST Abstains 0.214286 0.165854 Drinks 0.785714 0.834146 chi-square value, p value, expected counts (1.5804032985363066, 0.20870261116515335, 1, array([[ 30.2283737, 73.7716263], [137.7716263, 336.2283737]])) FAMCOMPv9 1 Par, ExtRel 2 Par ABST Abstains 479 36 Drinks 2345 132 FAMCOMPv9 1 Par, ExtRel 2 Par ABST Abstains 0.169618 0.214286 Drinks 0.830382 0.785714 chi-square value, p value, expected counts (1.9178315136173767, 0.16609591015709457, 1, array([[ 486.0828877, 28.9171123], [2337.9171123, 139.0828877]])) FAMCOMPv10 2 Par >1 ExtRel ABST Abstains 36 467 Drinks 132 1953 FAMCOMPv10 2 Par >1 ExtRel ABST Abstains 0.214286 0.192975 Drinks 0.785714 0.807025 chi-square value, p value, expected counts (0.32968603816180103, 0.5658440186807167, 1, array([[ 32.65224111, 470.34775889], [ 135.34775889, 1949.65224111]])) FAMCOMPv11 1 ExtRel 2 Par ABST Abstains 896 36 Drinks 3285 132 FAMCOMPv11 1 ExtRel 2 Par ABST Abstains 0.214303 0.214286 Drinks 0.785697 0.785714 chi-square value, p value, expected counts (0.009091829850638621, 0.9240359656620418, 1, array([[ 895.99724074, 36.00275926], [3285.00275926, 131.99724074]])) FAMCOMPv12 1 ExtRel >1 ExtRel ABST Abstains 896 467 Drinks 3285 1953 FAMCOMPv12 1 ExtRel >1 ExtRel ABST Abstains 0.214303 0.192975 Drinks 0.785697 0.807025 chi-square value, p value, expected counts (4.126102965534052, 0.04222648604697947, 1, array([[ 863.30904408, 499.69095592], [3317.69095592, 1920.30904408]])) FAMCOMPv13 1 ExtRel 1 Par, ExtRel ABST Abstains 896 479 Drinks 3285 2345 FAMCOMPv13 1 ExtRel 1 Par, ExtRel ABST Abstains 0.214303 0.169618 Drinks 0.785697 0.830382 chi-square value, p value, expected counts (21.051577696848984, 4.470845718926547e-06, 1, array([[ 820.68165596, 554.31834404], [3360.31834404, 2269.68165596]])) FAMCOMPv14 1 ExtRel 2 Par, ExtRel ABST Abstains 896 68 Drinks 3285 342 FAMCOMPv14 1 ExtRel 2 Par, ExtRel ABST Abstains 0.214303 0.165854 Drinks 0.785697 0.834146 chi-square value, p value, expected counts (4.995438947932991, 0.02541420669226183, 1, array([[ 877.90982357, 86.09017643], [3303.09017643, 323.90982357]])) FAMCOMPv15 1 ExtRel None known ABST Abstains 896 5886 Drinks 3285 13490 FAMCOMPv15 1 ExtRel None known ABST Abstains 0.214303 0.303778 Drinks 0.785697 0.696222 chi-square value, p value, expected counts (133.85527554859885, 5.876686334055551e-31, 1, array([[ 1203.69919769, 5578.30080231], [ 2977.30080231, 13797.69919769]])) FAMCOMPv15 1 Par, ExtRel >1 ExtRel ABST Abstains 479 467 Drinks 2345 1953 FAMCOMPv15 1 Par, ExtRel >1 ExtRel ABST Abstains 0.169618 0.192975 Drinks 0.830382 0.807025 chi-square value, p value, expected counts (4.652191417597543, 0.03101393014512997, 1, array([[ 509.44012204, 436.55987796], [2314.55987796, 1983.44012204]])) FAMCOMPv16 2 Par, ExtRel >1 ExtRel ABST Abstains 68 467 Drinks 342 1953 FAMCOMPv16 2 Par, ExtRel >1 ExtRel ABST Abstains 0.165854 0.192975 Drinks 0.834146 0.807025 chi-square value, p value, expected counts (1.5099437664564455, 0.2191476673124011, 1, array([[ 77.50883392, 457.49116608], [ 332.49116608, 1962.50883392]])) FAMCOMPv17 >1 ExtRel None known ABST Abstains 467 5886 Drinks 1953 13490 FAMCOMPv17 >1 ExtRel None known ABST Abstains 0.192975 0.303778 Drinks 0.807025 0.696222 chi-square value, p value, expected counts (127.35685258994046, 1.5520087618161547e-29, 1, array([[ 705.37071022, 5647.62928978], [ 1714.62928978, 13728.37071022]])) FAMCOMPv18 1 Par, ExtRel None known ABST Abstains 479 5886 Drinks 2345 13490 FAMCOMPv18 1 Par, ExtRel None known ABST Abstains 0.169618 0.303778 Drinks 0.830382 0.696222 chi-square value, p value, expected counts (216.27137431045233, 5.884680839320336e-49, 1, array([[ 809.67387387, 5555.32612613], [ 2014.32612613, 13820.67387387]])) FAMCOMPv19 1 Par, ExtRel 2 Par, ExtRel ABST Abstains 479 68 Drinks 2345 342 FAMCOMPv19 1 Par, ExtRel 2 Par, ExtRel ABST Abstains 0.169618 0.165854 Drinks 0.830382 0.834146 chi-square value, p value, expected counts (0.014277578109831067, 0.9048880969104677, 1, array([[ 477.6524428, 69.3475572], [2346.3475572, 340.6524428]])) FAMCOMPv20 2 Par, ExtRel None known ABST Abstains 68 5886 Drinks 342 13490 FAMCOMPv20 2 Par, ExtRel None known ABST Abstains 0.165854 0.303778 Drinks 0.834146 0.696222 chi-square value, p value, expected counts (35.65456040933068, 2.355957096167399e-09, 1, array([[ 123.37713535, 5830.62286465], [ 286.62286465, 13545.37713535]]))

Full code below if you click “Read More”. It’s a lot and it’s repetitive.

recode2={"1 Par":"1 Par", "2 Par":"2 Par"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv1']=subaa1['AAFAM2'].map(recode2) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv1"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode3={"1 Par":"1 Par", "1 ExtRel":"1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv2']=subaa1['AAFAM2'].map(recode3) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv2"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode4={"1 Par":"1 Par", ">1 ExtRel":">1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv3']=subaa1['AAFAM2'].map(recode4) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv3"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode5={"1 Par":"1 Par", "1 Par, ExtRel":"1 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv4']=subaa1['AAFAM2'].map(recode5) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv4"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode6={"1 Par":"1 Par", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv5']=subaa1['AAFAM2'].map(recode6) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv5"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode7={"1 Par":"1 Par", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv6']=subaa1['AAFAM2'].map(recode7) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv6"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode8={"2 Par":"2 Par", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv7']=subaa1['AAFAM2'].map(recode8) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv7"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode9={"2 Par":"2 Par", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv8']=subaa1['AAFAM2'].map(recode9) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv8"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode10={"2 Par":"2 Par", "1 Par, ExtRel":"1 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv9']=subaa1['AAFAM2'].map(recode10) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv9"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode11={"2 Par":"2 Par", ">1 ExtRel":">1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv10']=subaa1['AAFAM2'].map(recode11) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv10"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode12={"2 Par":"2 Par", "1 ExtRel":"1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv11']=subaa1['AAFAM2'].map(recode12) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv11"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode13={"1 ExtRel":"1 ExtRel", ">1 ExtRel":">1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv12']=subaa1['AAFAM2'].map(recode13) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv12"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode14={"1 ExtRel":"1 ExtRel", "1 Par, ExtRel":"1 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv13']=subaa1['AAFAM2'].map(recode14) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv13"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode15={"1 ExtRel":"1 ExtRel", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv14']=subaa1['AAFAM2'].map(recode15) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv14"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode16={"1 ExtRel":"1 ExtRel", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv15']=subaa1['AAFAM2'].map(recode16) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv15"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode16={">1 ExtRel":">1 ExtRel", "1 Par, ExtRel":"1 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv15']=subaa1['AAFAM2'].map(recode16) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv15"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode17={">1 ExtRel":">1 ExtRel", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv16']=subaa1['AAFAM2'].map(recode17) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv16"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode18={">1 ExtRel":">1 ExtRel", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv17']=subaa1['AAFAM2'].map(recode18) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv17"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode19={"1 Par, ExtRel":"1 Par, ExtRel", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv18']=subaa1['AAFAM2'].map(recode19) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv18"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode20={"1 Par, ExtRel":"1 Par, ExtRel", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv19']=subaa1['AAFAM2'].map(recode20) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv19"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

recode21={"None known":"None known", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv20']=subaa1['AAFAM2'].map(recode21) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv20"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)

0 notes

kpenlearnspython-blog · 6 years ago

Text

Data Analysis Tools: W1

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

I tried to blockquote all Python. Written code is italicized. Printed code is not.

I used the number of drinks per month and category of family members of alcoholism for my data set. As you can see in the last post, it looked like people who had 2 parents who were alcoholics had more drinks/month, but the variance was high. Previously, I have subsetted my data to only look at people who have not displayed signs of alcohol abuse/dependence.

subaa12=subaa1[["DRINKMO", "AAFAM2"]].dropna()

aafammodel = smf.ols(formula="DRINKMO ~ C(AAFAM2)", data=subaa12) aafamresults=aafammodel.fit() print(aafamresults.summary()) #F-statistic of 3.82e-06, so something is significantly different.

mc_aafam = multi.MultiComparison(subaa12["DRINKMO"], subaa12["AAFAM2"]) res_aafam = mc_aafam.tukeyhsd() #Request the test print(res_aafam.summary())

OLS Regression Results ============================================================================== Dep. Variable: DRINKMO R-squared: 0.001 Model: OLS Adj. R-squared: 0.001 Method: Least Squares F-statistic: 5.882 Date: Sat, 02 Feb 2019 Prob (F-statistic): 3.82e-06 Time: 20:30:58 Log-Likelihood: -1.3584e+05 No. Observations: 30996 AIC: 2.717e+05 Df Residuals: 30989 BIC: 2.718e+05 Df Model: 6 Covariance Type: nonrobust ============================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------------- Intercept 7.9278 0.449 17.650 0.000 7.047 8.808 C(AAFAM2)[T.2 Par] 4.6302 1.569 2.951 0.003 1.555 7.706 C(AAFAM2)[T.1 ExtRel] -1.0241 0.540 -1.896 0.058 -2.083 0.035 C(AAFAM2)[T.>1 ExtRel] -1.4900 0.598 -2.492 0.013 -2.662 -0.318 C(AAFAM2)[T.1 Par, ExtRel] -0.7644 0.579 -1.321 0.187 -1.899 0.370 C(AAFAM2)[T.2 Par, ExtRel] -0.1417 1.059 -0.134 0.894 -2.217 1.934 C(AAFAM2)[T.None known] -1.6631 0.470 -3.535 0.000 -2.585 -0.741 ============================================================================== Omnibus: 40095.815 Durbin-Watson: 1.986 Prob(Omnibus): 0.000 Jarque-Bera (JB): 10245364.757 Skew: 7.153 Prob(JB): 0.00 Kurtosis: 90.911 Cond. No. 18.0 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================================ group1 group2 meandiff lower upper reject ------------------------------------------------------------ 1 ExtRel 1 Par 1.0241 -0.5684 2.6167 False 1 ExtRel 1 Par, ExtRel 0.2597 -1.1335 1.653 False 1 ExtRel 2 Par 5.6543 1.1339 10.1748 True 1 ExtRel 2 Par, ExtRel 0.8824 -2.0804 3.8451 False 1 ExtRel >1 ExtRel -0.4659 -1.9278 0.996 False 1 ExtRel None known -0.639 -1.6149 0.337 False 1 Par 1 Par, ExtRel -0.7644 -2.4711 0.9423 False 1 Par 2 Par 4.6302 0.0036 9.2569 True 1 Par 2 Par, ExtRel -0.1417 -3.2642 2.9807 False 1 Par >1 ExtRel -1.49 -3.2532 0.2731 False 1 Par None known -1.6631 -3.0502 -0.276 True 1 Par, ExtRel 2 Par 5.3946 0.8327 9.9565 True 1 Par, ExtRel 2 Par, ExtRel 0.6226 -2.403 3.6483 False 1 Par, ExtRel >1 ExtRel -0.7257 -2.3111 0.8598 False 1 Par, ExtRel None known -0.8987 -2.0516 0.2541 False 2 Par 2 Par, ExtRel -4.7719 -10.03 0.4862 False 2 Par >1 ExtRel -6.1202 -10.7035 -1.5369 True 2 Par None known -6.2933 -10.7455 -1.8411 True 2 Par, ExtRel >1 ExtRel -1.3483 -4.4061 1.7096 False 2 Par, ExtRel None known -1.5213 -4.3789 1.3362 False >1 ExtRel None known -0.1731 -1.4079 1.0618 False ------------------------------------------------------------

Despite what looked like high error bars, the data first gives a low F-statistic of 3.82e-06, indicating that some data does allow for rejection of the null hypothesis. When doing the Tukey test to determine which sets I can reject the null hypothesis (that there is no difference in number of drinks per month between people with different alcoholic family histories), it indicates:

1 Parent and 2 Parents

1 Parent and None known

1 Parent+Extended Relatives and 2 Parents

2 Parents and Multiple Extended Relatives

2 Parents and None known

Noticeably, I only got this result when I used .dropna() and I haven’t figured out why yet. I’m going to do some googling and review the lesson.

0 notes

kpenlearnspython-blog · 6 years ago

Text

Week 4: Graph Time

Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Secondary Research Question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)

I tried to blockquote all Python. Written code is italicized. Printed code is not.

For my univariate graphs, I first wanted to look at the number of people with no reported alcohol problem that have family history of alcoholism.

import seaborn as sb import matplotlib.pyplot as plt #seaborn is dependent upon this to create graphs#Create univariate graphs to show center and spread. #This is just general counts. #Will do number of people with no reported alcohol problem that have family history of alcoholism. subaa = nesarc[(nesarc['ALCABDEP12DX']==0) & (nesarc['ALCABDEPP12DX']==0)] #No history of alcohol abuse subgroup

subaa1=subaa.copy() #Now need to make a single variable for those with family history. subaa1['S2DQ1'] = pd.to_numeric(subaa1['S2DQ1']) #Father subaa1['S2DQ2'] = pd.to_numeric(subaa1['S2DQ2']) #Mother subaa1['S2DQ7C2'] = pd.to_numeric(subaa1['S2DQ7C2']) #Dad's bro subaa1['S2DQ8C2'] = pd.to_numeric(subaa1['S2DQ8C2']) #Dad's sis subaa1['S2DQ9C2'] = pd.to_numeric(subaa1['S2DQ9C2']) #Mom's bro subaa1['S2DQ10C2'] = pd.to_numeric(subaa1['S2DQ10C2']) #Mom's sis subaa1['S2DQ11'] = pd.to_numeric(subaa1['S2DQ11']) #Dad's pa subaa1['S2DQ12'] = pd.to_numeric(subaa1['S2DQ12']) #Dad's ma subaa1['S2DQ13A'] = pd.to_numeric(subaa1['S2DQ13A']) #Mom's pa subaa1['S2DQ13B'] = pd.to_numeric(subaa1['S2DQ13B']) #Mom's ma

#Replace all "nos" to "0"s subaa1['S2DQ1'] = subaa1['S2DQ1'].replace([2], 0) #Dad subaa1['S2DQ2'] = subaa1['S2DQ2'].replace([2], 0) #Mom subaa1['S2DQ7C2'] = subaa1['S2DQ7C2'].replace([2], 0) #D-bro subaa1['S2DQ8C2'] = subaa1['S2DQ8C2'].replace([2], 0) #D-sis subaa1['S2DQ9C2'] = subaa1['S2DQ9C2'].replace([2], 0) #M-bro subaa1['S2DQ10C2'] = subaa1['S2DQ10C2'].replace([2], 0) #M-sis subaa1['S2DQ11'] = subaa1['S2DQ11'].replace([2], 0) #D-pa subaa1['S2DQ12'] = subaa1['S2DQ12'].replace([2], 0) #D-ma subaa1['S2DQ13A'] = subaa1['S2DQ13A'].replace([2], 0) #M-pa subaa1['S2DQ13B'] = subaa1['S2DQ13B'].replace([2], 0) #M-ma

subaa1['S2DQ1'] = subaa1['S2DQ1'].replace([9], np.nan) #Dad subaa1['S2DQ2'] = subaa1['S2DQ2'].replace([9], np.nan) #Mom subaa1['S2DQ7C2'] = subaa1['S2DQ7C2'].replace([9], np.nan) #D-bro subaa1['S2DQ8C2'] = subaa1['S2DQ8C2'].replace([9], np.nan) #D-sis subaa1['S2DQ9C2'] = subaa1['S2DQ9C2'].replace([9], np.nan) #M-bro subaa1['S2DQ10C2'] = subaa1['S2DQ10C2'].replace([9], np.nan) #M-sis subaa1['S2DQ11'] = subaa1['S2DQ11'].replace([9], np.nan) #D-pa subaa1['S2DQ12'] = subaa1['S2DQ12'].replace([9], np.nan) #D-ma subaa1['S2DQ13A'] = subaa1['S2DQ13A'].replace([9], np.nan) #M-pa subaa1['S2DQ13B'] = subaa1['S2DQ13B'].replace([9], np.nan) #M-ma

def FAM(row): if row["S2DQ1"]>0 or row["S2DQ2"]>0 or row["S2DQ7C2"]>0 or row["S2DQ8C2"]>0 or row["S2DQ9C2"]>0 or row["S2DQ10C2"]>0 or row["S2DQ11"]>0 or row["S2DQ12"]>0 or row["S2DQ13A"]>0 or row["S2DQ13B"]>0: return 1 else: return 0

subaa1["FAM"] = subaa1.apply(lambda row: FAM(row), axis=1) subaa12 = subaa1[["FAM", "S2DQ1", "S2DQ2", "S2DQ7C2", "S2DQ8C2", "S2DQ9C2", "S2DQ10C2", "S2DQ11", "S2DQ12", "S2DQ13A", "S2DQ13B"]] subaa12.head(n=25)

c11=subaa1["FAM"].value_counts(sort=False, dropna=False) p11=subaa1["FAM"].value_counts(sort=False, dropna=False, normalize=True) print("Family history of alcohol abuse or dependence--1 is yes, 0 is no") print(c11)

subaa1["FAM"] = subaa1["FAM"].astype('category') sb.countplot(x="FAM", data=subaa1) plt.xlabel("Presence of alcohol abuse or dependence in previous generation") plt.title("Alcoholism in family history for individuals with no personal history of alcohol abuse or dependence")

Family history of alcohol abuse or dependence--1 is yes, 0 is no 0 19376 1 11874 Name: FAM, dtype: int64

So, the majority of people who do not presently exhibit alcohol abuse or dependence have no family history of alcoholism.

I then wanted to look at histograms of drinks consumed by those not reported to exhibit alcohol abuse/dependence for those with and without family history. I used previously created sub-datasets to visualize this.

sb.distplot(subaafam2["DRINKMO"].dropna(), kde=False); plt.xlabel("Number of Drinks Per Month") plt.title("Estimated # of Drinks/Month for those with Family History of Alcoholism")

sb.distplot(subaafam4["DRINKMO"].dropna(), kde=False); plt.xlabel("Number of Drinks Per Month") plt.title("Estimated # of Drinks/Month for those with NO Family History of Alcoholism")

For both of these data sets, as I have seen in previous counts, the majority of people consume very few drinks, though there are some high outliers. This makes this histogram on the whole fairly uninformative.

Next, I wanted to examine a bivariate relationship between the average number of drinks per month and family history of alcoholism.

subaa1['S2AQ8A'] = pd.to_numeric(subaa1['S2AQ8A'],errors='coerce').fillna(0).astype(int) subaa1.loc[(subaa1["S2AQ3"]!=9) & (subaa1["S2AQ8A"]==0), "S2AQ8A"]=11 subaa1["USFREQ"]=subaa1["S2AQ8A"].map(recode1) subaa1['S2AQ8B'] = pd.to_numeric(subaa1['S2AQ8B'],errors='coerce').fillna(0).astype(int) subaa1['S2AQ8B']=subaa1['S2AQ8B'].replace(99, np.nan) #print(subaafam4["S2AQ8B"].value_counts(sort=False, dropna=False))#Just to make sure this worked subaa1["DRINKMO"]=subaa1["S2AQ8B"]*subaa1["USFREQ"]/12 print(max(subaa1["DRINKMO"])) subaa1['DRINKMO10']=pd.qcut(subaa1.DRINKMO, 10, duplicates="drop", labels=["1=50%tile", "2=60%tile", "3=70%tile", "4=80%tile", "5=90%tile", "6=100%tile"]) subaa1["DRINKMO10"].head(n=25)

subaa1["FAM2"]=subaa1["FAM"]

subaa1["FAM2"]=subaa1["FAM2"].cat.rename_categories(["No Family History", "Family History"])

sb.factorplot(x='FAM2', y='DRINKMO', data=subaa1, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Average of Drinks/Month')

Interestingly, the average number of drinks/month is slightly higher for those with family history of alcoholism (though I’m not sure if this is statistically significant).

Finally, I wanted to explore if there was any relationship in the number of drinks/month and alcoholic family history. Specifically, if degree of relationship correlated with drink consumption.

parents = np.dstack((subaa1["S2DQ1"],subaa1["S2DQ2"])) #making variables of interest into a single array print(parents.shape) #double checking (1, 31250, 2)

parents2 = np.nansum(parents,2) #adding the two elements together on the correct dimension print(parents2.T.shape) (31250, 1)

subaa1["AAFAMPAR2"]=parents2.T #adding the column into the dataset as a new variable extfam = np.dstack((subaa1["S2DQ7C2"],subaa1["S2DQ8C2"],subaa1["S2DQ9C2"],subaa1["S2DQ10C2"],subaa1["S2DQ11"],subaa1["S2DQ12"],subaa1["S2DQ13A"],subaa1["S2DQ13B"])) print(extfam.shape) (1, 31250, 8)

extfam2=np.nansum(extfam, 2) print(extfam2.shape) (1, 31250)

subaa1["AAFAMEXT2"]=extfam2.Tdef AAFAM2(row): if row["AAFAMPAR2"]==1 and row["AAFAMEXT2"]==0: return 1 #1 alcoholic parent if row["AAFAMPAR2"]==2 and row["AAFAMEXT2"]==0: return 2 #both alcoholic parents if row["AAFAMPAR2"]==0 and row["AAFAMEXT2"]==1: return 3 #one alcoholic relative if row["AAFAMPAR2"]==0 and row["AAFAMEXT2"]>1: return 4 #multiple alcoholic relatives if row["AAFAMPAR2"]==1 and row["AAFAMEXT2"]>0: return 5 #One alcoholic parent and at least 1 alcoholic relative if row["AAFAMPAR2"]>1 and row["AAFAMEXT2"]>0: return 6 #Both parents and at least 1 relative if row["AAFAMPAR2"]==0 and row["AAFAMEXT2"]==0: return 7 #No known alcoholic family historyprint("For those with family history of alcohol abuse/dependece, but no personal history, I wanted to look at the distribution of family with alcoholism (parents vs extended family)" + "\n" "Number code:" + "\n" "1. A single alcoholic parent" + "\n" "2. Two alcoholic parents" + "\n" "3. One alcoholic extended relative" + "\n" "4. Multiple alcoholic extended relatives" + "\n" "5. One alcoholic parents and at least one alcoholic relative" + "\n" "6. Both parents and at least one extended relative" + "\n" "7. No known alcoholic family history") subaa1["AAFAM2"] = subaa1.apply(lambda row: AAFAM2(row), axis=1) c13=subaa1["AAFAM2"].value_counts(sort=False, dropna=False) p13=subaa1["AAFAM2"].value_counts(sort=False, dropna=False, normalize=True) print("Percentages") print(p13.sort_index()) print("Counts") print(c13.sort_index()) For those with family history of alcohol abuse/dependece, but no personal history, I wanted to look at the distribution of family with alcoholism (parents vs extended family) Number code: 1. A single alcoholic parent 2. Two alcoholic parents 3. One alcoholic extended relative 4. Multiple alcoholic extended relatives 5. One alcoholic parents and at least one alcoholic relative 6. Both parents and at least one extended relative 7. No known alcoholic family history Percentages 1 0.059872 2 0.005376 3 0.133792 4 0.077440 5 0.090368 6 0.013120 7 0.620032 Name: AAFAM2, dtype: float64 Counts 1 1871 2 168 3 4181 4 2420 5 2824 6 410 7 19376 Name: AAFAM2, dtype: int64

subaa1["AAFAM2"]=subaa1["AAFAM2"].astype("category") subaa1["AAFAM2"]=subaa1["AAFAM2"].cat.rename_categories(["1 Par", "2 Par", "1 ExtRel", ">1 ExtRel", "1 Par, ExtRel", "2 Par, ExtRel", "None known"])

sb.factorplot(x='AAFAM2', y='DRINKMO', data=subaa1, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Avg of Drinks/Month') plt.xticks(rotation = 45)

Interestingly, those with two parents seem to have a slightly average of drinks/month, though the variance on this large, so it’s hard to know if it’s statistically significant. Worth noting, we don’t see the same average for those that have 2 parents and extended relatives with alcoholism.

Summary:

More people that have no personal history of alcohol abuse or dependence also do not have family history of alcohol abuse/dependence than those that do have family history.

When looking at counts of average drinks/month for those with and without family history, the vast majority of people in both data sets consume fairly few drinks.

Despite this, if you look average numbers of drinks/month for those with and without family history, those with family history have a slightly higher average. Not sure if this is statistically significant or not.

Furthermore, if you look at the relationship of alcoholic family members to alcohol consumption for those that have no history of alcohol abuse/dependence, you see a higher rate of consumption for those with 2 parents. That being said, the variance is high here, so it is unclear if it is statistically significant.

0 notes

kpenlearnspython-blog · 6 years ago

Text

Week 3: Python Hell

Primary research question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

I changed my statistic to drinks/month in this, though I do get that distribution here.

Secondary research question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation, present vs. absent alcoholic parent)

Full disclosure--I haven’t touched the present vs. absent thing yet.

This week I try to better sort my data, and, quite frankly, I’m not sure if I did everything right. But I tried. I am beginning to think that I made things too complicated for myself. Too late now. Code is blockquoted, with raw code in italics and normal letters being the output.

First, I need to re-do the grouping for the abstainers vs. did not answer. I realized in watching the lesson videos that those were not exactly the same.

#subaafam2 is for those that have an alcoholic family history. #subaafam4 is for those with no alcoholic family history print("This data is for individuals who, as reported by this study, have not experienced alcohol abuse/dependence.")

print("Key:" + "\n" "0. Did not answer." +"\n" "1. Every day" +"\n" "2. Nearly every day" +"\n" "3. 3 to 4 times a week" +"\n" "4. 2 times a week" +"\n" "5. Once a week" +"\n" "6. 2 to 3 times a month" +"\n" "7. Once a month" +"\n" "8. 7 to 11 times in the last year" +"\n" "9. 3 to 6 times in the last year" +"\n" "10. 1 or 2 times in the last year" +"\n" "11. Have no drinken in the past 12 months." + "\n" "99. Unknown")

subaafam4['S2AQ8A'] = pd.to_numeric(subaafam4['S2AQ8A'],errors='coerce').fillna(0).astype(int) subaafam4.loc[(subaafam4["S2AQ3"]!=9) & (subaafam4["S2AQ8A"]==0), "S2AQ8A"]=11 print("Distribution of alcohohol consumption frequency for those with NO family history of alcoholism:") subaafam4['S2AQ8A'] = pd.to_numeric(subaafam4['S2AQ8A']) print("Percentages") p4 = subaafam4['S2AQ8A'].value_counts(sort=False, dropna=False, normalize=True) print(p4.sort_index()) print("Counts") c4 = subaafam4['S2AQ8A'].value_counts(sort=False, dropna=False) print(c4.sort_index())

subaafam2['S2AQ8A'] = pd.to_numeric(subaafam2['S2AQ8A'],errors='coerce').fillna(0).astype(int) subaafam2.loc[(subaafam2["S2AQ3"]!=9) & (subaafam2["S2AQ8A"]==0), "S2AQ8A"]=11 print("Percetage of alcohohol consumption frequency for those with family history of alcoholism:") subaafam2['S2AQ8A'] = pd.to_numeric(subaafam2['S2AQ8A']) print("Percentages") p5 = subaafam2['S2AQ8A'].value_counts(sort=False, dropna=False, normalize=True) print(p5.sort_index()) print("Counts") c5 = subaafam2['S2AQ8A'].value_counts(sort=False, dropna=False) print(c5.sort_index())

This data is for individuals who, as reported by this study, have not experienced alcohol abuse/dependence. Key: 0. Did not answer. 1. Every day 2. Nearly every day 3. 3 to 4 times a week 4. 2 times a week 5. Once a week 6. 2 to 3 times a month 7. Once a month 8. 7 to 11 times in the last year 9. 3 to 6 times in the last year 10. 1 or 2 times in the last year 11. Have no drinken in the past 12 months. 99. Unknown Distribution of alcohohol consumption frequency for those with NO family history of alcoholism: Percentages 0 0.000086 1 0.030639 2 0.017079 3 0.035530 4 0.049176 5 0.064796 6 0.067456 7 0.061535 8 0.037590 9 0.069945 10 0.089341 11 0.471765 99 0.005064 Name: S2AQ8A, dtype: float64 Counts 0 1 1 357 2 199 3 414 4 573 5 755 6 786 7 717 8 438 9 815 10 1041 11 5497 99 59 Name: S2AQ8A, dtype: int64 Percetage of alcohohol consumption frequency for those with family history of alcoholism: Percentages 0 0.000253 1 0.029308 2 0.016928 3 0.043119 4 0.045393 5 0.064511 6 0.081270 7 0.064426 8 0.050109 9 0.104514 10 0.115968 11 0.382095 99 0.002105 Name: S2AQ8A, dtype: float64 Counts 0 3 1 348 2 201 3 512 4 539 5 766 6 965 7 765 8 595 9 1241 10 1377 11 4537 99 25 Name: S2AQ8A, dtype: int64

Second, I want to get an actual number of days that they drank alcohol. So, I need to change S2AQ8A into actual numbers to get an approximate number of days drinking occurred for each individual.

recode1={1:365, 2:300, 3:180, 4:104, 5:52, 6:30, 7:12, 8:9, 9:4.5, 10:1.5, 11:0} subaafam4["USFREQ"]=subaafam4["S2AQ8A"].map(recode1) print("Approximate number of days alcohol was consumed in the last 12 months for those with NO family history of alcoholism:") print("Percentages") p6 = subaafam4['USFREQ'].value_counts(sort=False, dropna=False, normalize=True) print(p6.sort_index()) print("Counts") c6 = subaafam4['USFREQ'].value_counts(sort=False, dropna=False) print(c6.sort_index())

print("Approximate number of days alcohol was consumed in the last 12 months for those with family history of alcoholism:") subaafam2["USFREQ"]=subaafam2["S2AQ8A"].map(recode1) print("Percentages") p7 = subaafam2['USFREQ'].value_counts(sort=False, dropna=False, normalize=True) print(p7.sort_index()) print("Counts") c7 = subaafam2['USFREQ'].value_counts(sort=False, dropna=False) print(c7.sort_index()) Approximate number of days alcohol was consumed in the last 12 months for those with NO family history of alcoholism: Percentages 0.000000 0.471765 1.500000 0.089341 4.500000 0.069945 9.000000 0.037590 12.000000 0.061535 30.000000 0.067456 52.000000 0.064796 104.000000 0.049176 180.000000 0.035530 300.000000 0.017079 365.000000 0.030639 nan 0.005149 Name: USFREQ, dtype: float64 Counts 0.000000 5497 1.500000 1041 4.500000 815 9.000000 438 12.000000 717 30.000000 786 52.000000 755 104.000000 573 180.000000 414 300.000000 199 365.000000 357 nan 60 Name: USFREQ, dtype: int64 Approximate number of days alcohol was consumed in the last 12 months for those with family history of alcoholism: Percentages 0.000000 0.382095 1.500000 0.115968 4.500000 0.104514 9.000000 0.050109 12.000000 0.064426 30.000000 0.081270 52.000000 0.064511 104.000000 0.045393 180.000000 0.043119 300.000000 0.016928 365.000000 0.029308 nan 0.002358 Name: USFREQ, dtype: float64 Counts 0.000000 4537 1.500000 1377 4.500000 1241 9.000000 595 12.000000 765 30.000000 965 52.000000 766 104.000000 539 180.000000 512 300.000000 201 365.000000 348 nan 28 Name: USFREQ, dtype: int64

Third, I need to multiply the result from USFREQ (number of days drinking) by S2AQ8B (number of drinks usually consumed on days when drinking) to get total number of drinks consumed in a year. Then, I’ll divide this by 12 to get # drinks consumed/month (on average).

subaafam4['S2AQ8B'] = pd.to_numeric(subaafam4['S2AQ8B'],errors='coerce').fillna(0).astype(int) subaafam4['S2AQ8B']=subaafam4['S2AQ8B'].replace(99, np.nan) #print(subaafam4["S2AQ8B"].value_counts(sort=False, dropna=False))#Just to make sure this worked subaafam4["DRINKMO"]=subaafam4["S2AQ8B"]*subaafam4["USFREQ"]/12 print("Max drinks/month") print(max(subaafam4["DRINKMO"])) #Get the idea of what the highest number needs to be. print("Average drinks per month for those with NO family history of alcoholism:") print("NaN indicates the amout of drinks consumed was unknown") #Split into categories (0-0.5), (0.5+-1), (1+-3), (3+-7.5), (7.5+-15), (15+-30), (30+-60), (60+) subaafam4["DRINKMO"]=pd.cut(subaafam4.DRINKMO, [-1,0.5,1,3,7.5,15,30,60,520]) print("Percentages") p8=subaafam4["DRINKMO"].value_counts(sort=False, dropna=False, normalize=True) print(p8.sort_index()) print("Counts") c8=subaafam4["DRINKMO"].value_counts(sort=False, dropna=False) print(c8.sort_index())

subaafam2['S2AQ8B'] = pd.to_numeric(subaafam2['S2AQ8B'],errors='coerce').fillna(0).astype(int) subaafam2['S2AQ8B']=subaafam2['S2AQ8B'].replace(99, np.nan) #print(subaafam2["S2AQ8B"].value_counts(sort=False, dropna=False))#Just to make sure this worked subaafam2["DRINKMO"]=subaafam2["S2AQ8B"]*subaafam2["USFREQ"]/12 print("Max drinks/month") print(max(subaafam2["DRINKMO"])) #Get the idea of what the highest number needs to be. print("Average drinks per month for those with family history of alcoholism") #Split into categories (0-0.5), (0.5+-1), (1+-3), (3+-7.5), (7.5+-15), (15+-30), (30+-60), (60+) subaafam2["DRINKMO"]=pd.cut(subaafam2.DRINKMO, [-1,0.5,1,3,7.5,15,30,60,520]) print("Percentages") p9=subaafam2["DRINKMO"].value_counts(sort=False, dropna=False, normalize=True) print(p9.sort_index()) print("Counts") c9=subaafam2["DRINKMO"].value_counts(sort=False, dropna=False) print(c9.sort_index()) Max drinks/month 517.0833333333334 Average drinks per month for those with NO family history of alcoholism: NaN indicates the amout of drinks consumed was unknown Percentages NaN 0.005149 (-1.0, 0.5] 0.607878 (0.5, 1.0] 0.067199 (1.0, 3.0] 0.073721 (3.0, 7.5] 0.064624 (7.5, 15.0] 0.064281 (15.0, 30.0] 0.054926 (30.0, 60.0] 0.038363 (60.0, 520.0] 0.023859 Name: DRINKMO, dtype: float64 Counts NaN 60 (-1.0, 0.5] 7083 (0.5, 1.0] 783 (1.0, 3.0] 859 (3.0, 7.5] 753 (7.5, 15.0] 749 (15.0, 30.0] 640 (30.0, 60.0] 447 (60.0, 520.0] 278 Name: DRINKMO, dtype: int64 Max drinks/month 395.4166666666667 Average drinks per month for those with family history of alcoholism Percentages NaN 0.002358 (-1.0, 0.5] 0.562911 (0.5, 1.0] 0.084049 (1.0, 3.0] 0.088092 (3.0, 7.5] 0.072343 (7.5, 15.0] 0.067458 (15.0, 30.0] 0.057268 (30.0, 60.0] 0.037056 (60.0, 520.0] 0.028466 Name: DRINKMO, dtype: float64 Counts NaN 28 (-1.0, 0.5] 6684 (0.5, 1.0] 998 (1.0, 3.0] 1046 (3.0, 7.5] 859 (7.5, 15.0] 801 (15.0, 30.0] 680 (30.0, 60.0] 440 (60.0, 520.0] 338 Name: DRINKMO, dtype: int64

Fourth, I will try to re-divide the data for the family relationship. I will have categories for both parents, parent + additional relatives, more than one additional relative, just one parent, just one non-parent relative.

subaafam2['S2DQ1'] = pd.to_numeric(subaafam2['S2DQ1']) #Father subaafam2['S2DQ2'] = pd.to_numeric(subaafam2['S2DQ2']) #Mother subaafam2['S2DQ7C2'] = pd.to_numeric(subaafam2['S2DQ7C2']) #Dad's bro subaafam2['S2DQ8C2'] = pd.to_numeric(subaafam2['S2DQ8C2']) #Dad's sis subaafam2['S2DQ9C2'] = pd.to_numeric(subaafam2['S2DQ9C2']) #Mom's bro subaafam2['S2DQ10C2'] = pd.to_numeric(subaafam2['S2DQ10C2']) #Mom's sis subaafam2['S2DQ11'] = pd.to_numeric(subaafam2['S2DQ11']) #Dad's pa subaafam2['S2DQ12'] = pd.to_numeric(subaafam2['S2DQ12']) #Dad's ma subaafam2['S2DQ13A'] = pd.to_numeric(subaafam2['S2DQ13A']) #Mom's pa subaafam2['S2DQ13B'] = pd.to_numeric(subaafam2['S2DQ13B']) #Mom's ma

#Replace all "nos" to "0"s subaafam2['S2DQ1'] = subaafam2['S2DQ1'].replace([2], 0) #Dad subaafam2['S2DQ2'] = subaafam2['S2DQ2'].replace([2], 0) #Mom subaafam2['S2DQ7C2'] = subaafam2['S2DQ7C2'].replace([2], 0) #D-bro subaafam2['S2DQ8C2'] = subaafam2['S2DQ8C2'].replace([2], 0) #D-sis subaafam2['S2DQ9C2'] = subaafam2['S2DQ9C2'].replace([2], 0) #M-bro subaafam2['S2DQ10C2'] = subaafam2['S2DQ10C2'].replace([2], 0) #M-sis subaafam2['S2DQ11'] = subaafam2['S2DQ11'].replace([2], 0) #D-pa subaafam2['S2DQ12'] = subaafam2['S2DQ12'].replace([2], 0) #D-ma subaafam2['S2DQ13A'] = subaafam2['S2DQ13A'].replace([2], 0) #M-pa subaafam2['S2DQ13B'] = subaafam2['S2DQ13B'].replace([2], 0) #M-ma

subaafam2['S2DQ1'] = subaafam2['S2DQ1'].replace([9], np.nan) #Dad subaafam2['S2DQ2'] = subaafam2['S2DQ2'].replace([9], np.nan) #Mom subaafam2['S2DQ7C2'] = subaafam2['S2DQ7C2'].replace([9], np.nan) #D-bro subaafam2['S2DQ8C2'] = subaafam2['S2DQ8C2'].replace([9], np.nan) #D-sis subaafam2['S2DQ9C2'] = subaafam2['S2DQ9C2'].replace([9], np.nan) #M-bro subaafam2['S2DQ10C2'] = subaafam2['S2DQ10C2'].replace([9], np.nan) #M-sis subaafam2['S2DQ11'] = subaafam2['S2DQ11'].replace([9], np.nan) #D-pa subaafam2['S2DQ12'] = subaafam2['S2DQ12'].replace([9], np.nan) #D-ma subaafam2['S2DQ13A'] = subaafam2['S2DQ13A'].replace([9], np.nan) #M-pa subaafam2['S2DQ13B'] = subaafam2['S2DQ13B'].replace([9], np.nan) #M-ma

subaafam2["AAFAMPAR"]=subaafam2["S2DQ1"]+subaafam2["S2DQ2"] subaafam2["AAFAMEXT"]=subaafam2["S2DQ7C2"]+subaafam2["S2DQ8C2"]+subaafam2["S2DQ9C2"]+subaafam2["S2DQ10C2"]+subaafam2["S2DQ11"]+subaafam2["S2DQ12"]+subaafam2["S2DQ13A"]+subaafam2["S2DQ13B"]

def AAFAM(row): if row["AAFAMPAR"]==1 and row["AAFAMEXT"]==0: return 1 #1 alcoholic parent if row["AAFAMPAR"]==2 and row["AAFAMEXT"]==0: return 2 #both alcoholic parents if row["AAFAMPAR"]==0 and row["AAFAMEXT"]==1: return 3 #one alcoholic relative if row["AAFAMPAR"]==0 and row["AAFAMEXT"]>1: return 4 #multiple alcoholic relatives if row["AAFAMPAR"]==1 and row["AAFAMEXT"]>0: return 5 #One alcoholic parent and at least 1 alcoholic relative if row["AAFAMPAR"]>1 and row["AAFAMEXT"]>0: return 6 #Both parents and at least 1 relative

print("For those with family history of alcohol abuse/dependece, but no personal history, I wanted to look at the distribution of family with alcoholism (parents vs extended family)" + "\n" "Number code:" + "\n" "1. A single alcoholic parent" + "\n" "2. Two alcoholic parents" + "\n" "3. One alcoholic extended relative" + "\n" "4. Multiple alcoholic extended relatives" + "\n" "5. One alcoholic parents and at least one alcoholic relative" + "\n" "6. Both parents and at least one extended relative" + "\n" "nan A parent/relative alcohol abuse/dependence was unknown") subaafam2["AAFAM"] = subaafam2.apply(lambda row: AAFAM(row), axis=1) c10=subaafam2["AAFAM"].value_counts(sort=False, dropna=False) p10=subaafam2["AAFAM"].value_counts(sort=False, dropna=False, normalize=True) print("Percentages") print(p10.sort_index()) print("Counts") print(c10.sort_index())

subaafam21 = subaafam2[["AAFAM", "S2DQ1", "S2DQ2", "S2DQ7C2", "S2DQ8C2", "S2DQ9C2", "S2DQ10C2", "S2DQ11", "S2DQ12", "S2DQ13A", "S2DQ13B"]] a=subaafam21.head(n=25) print("Categrization confirmation:") print(a) For those with family history of alcohol abuse/dependece, but no personal history, I wanted to look at the distribution of family with alcoholism (parents vs extended family) Number code: 1. A single alcoholic parent 2. Two alcoholic parents 3. One alcoholic extended relative 4. Multiple alcoholic extended relatives 5. One alcoholic parents and at least one alcoholic relative 6. Both parents and at least one extended relative nan A parent/relative alcohol abuse/dependence was unknown Percentages 1.000000 0.069227 2.000000 0.005727 3.000000 0.222335 4.000000 0.138538 5.000000 0.130537 6.000000 0.018107 nan 0.415530 Name: AAFAM, dtype: float64 Counts 1.000000 822 2.000000 68 3.000000 2640 4.000000 1645 5.000000 1550 6.000000 215 nan 4934 Name: AAFAM, dtype: int64 Categrization confirmation: AAFAM S2DQ1 S2DQ2 ... S2DQ12 S2DQ13A S2DQ13B 0 1.000000 1.000000 0.000000 ... 0.000000 0.000000 0.000000 1 5.000000 1.000000 0.000000 ... 0.000000 1.000000 0.000000 6 5.000000 1.000000 0.000000 ... 0.000000 0.000000 0.000000 14 nan 1.000000 0.000000 ... nan nan 0.000000 19 4.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 20 3.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 25 1.000000 1.000000 0.000000 ... 0.000000 0.000000 0.000000 27 nan 0.000000 0.000000 ... nan nan nan 28 4.000000 0.000000 0.000000 ... 0.000000 1.000000 0.000000 33 nan 0.000000 0.000000 ... 0.000000 nan 0.000000 37 5.000000 1.000000 0.000000 ... 0.000000 1.000000 0.000000 50 3.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 58 nan 1.000000 0.000000 ... nan nan nan 60 5.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000 62 nan 1.000000 0.000000 ... nan nan nan 63 3.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 64 nan 1.000000 0.000000 ... nan nan nan 70 1.000000 1.000000 0.000000 ... 0.000000 0.000000 0.000000 74 nan 0.000000 0.000000 ... 0.000000 1.000000 1.000000 75 nan 1.000000 0.000000 ... 0.000000 nan 0.000000 76 4.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 85 nan 0.000000 0.000000 ... 0.000000 1.000000 1.000000 87 nan 1.000000 0.000000 ... nan 0.000000 0.000000 90 1.000000 1.000000 0.000000 ... 0.000000 0.000000 0.000000 91 nan 1.000000 0.000000 ... nan nan nan

[25 rows x 11 columns]

0 notes

kpenlearnspython-blog · 6 years ago

Text

Week 2

Primary research question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

I’m doing more than just three frequency tables, and I’ll explain the logic behind each one.

To answer my research question, I need to verify that there’s a substantial percentage of the total population that has not had alcohol abuse/dependence in the last 12 months or prior. I do so through here:

import pandas import numpy

nesarc = pandas.read_csv('NESARC Data.csv', low_memory=False)

#To avoid runtime errors pandas.set_option('display.float_format', lambda x:'%f'%x)

print("0. No alcohol diagnosis" + "\n" "1. Alcohol abuse only" + "\n" "2. Alcohol dependence only" + "\n" "3. Alcohol abuse and dependence") print("Percentage of Alcohol abuse/dependence in last 12 months") aa12_p = nesarc['ALCABDEP12DX'].value_counts(sort=False, normalize=True) print(aa12_p)

# Alcohol abuse/dependence prior to last 12 months / 3649-3649 ALCABDEPP12DX print("Percentage of Alcohol abuse/dependence prior to last 12 months") aa0_p = nesarc['ALCABDEPP12DX'].value_counts(sort=False, normalize=True) print(aa0_p)

0. No alcohol diagnosis 1. Alcohol abuse only 2. Alcohol dependence only 3. Alcohol abuse and dependence Percentage of Alcohol abuse/dependence in last 12 months 0 0.922795 1 0.042768 2 0.012833 3 0.021604 Name: ALCABDEP12DX, dtype: float64 Percentage of Alcohol abuse/dependence prior to last 12 months 0 0.735085 1 0.162300 2 0.013065 3 0.089551 Name: ALCABDEPP12DX, dtype: float64

These statistics validate for me that a large percentage of the data set has no alcohol diagnosis (0), so it should be substantially large enough for me to exclude those with any history of alcohol abuse/dependence. To do this, I took a subset of the data.

subaa = nesarc[(nesarc['ALCABDEP12DX']==0) & (nesarc['ALCABDEPP12DX']==0)]

subaa1=subaa.copy()

Next, I want to look at people with a reported family history of alcohol abuse. I just need them to have answered yes (1) to one of the questions asking about an elder relative, so I strung everything together with “or” (|) statements. I then wanted to look at the percentage of people with no previous alcohol abuse/dependence have an elder family member with alcohol abuse/dependence. For relative reference:

Alcoholic dad / 653-653 S2DQ1

Alcoholic mom / 654-654 S2DQ2

Alcoholic uncle / 679-679, 689-689 S2DQ7C2 S2DQ9C2

Alcoholic aunt / 684-684, 694-694 S2DQ8C2 S2DQ10C2

Alcoholic grandpa / 695-695, 697-697 S2DQ11 S2DQ13A

Alcoholic grandma / 696-696, 698-698 S2DQ12 S2DQ13B

subaafam2=subaafam20.copy()

#Then, I will compare the length of this data set to the length of the non- #diagnosed data set to get a percentage. print("The percentage of those with no history of alcohol dependence/abuse with a known, elder relative with alcoholism:") print(len(subaafam2)/len(subaa1)*100) The percentage of those with no history of alcohol dependence/abuse with a known, elder relative with alcoholism: 37.9968

I will also need a separate subset for those with no known family member (answered no, or 2, to every question asking them if they had an alcoholic family member). I am eliminating anyone where things are unknown. This is an & subsetting since it needs to apply to all the family members.

subaafam40 = subaa1[(subaa1['S2DQ1']==2) & (subaa1['S2DQ2']==2) & (subaa1['S2DQ7C2']==2) & (subaa1['S2DQ9C2']==2) & (subaa1['S2DQ8C2']==2) & (subaa1['S2DQ10C2']==2) & (subaa1['S2DQ11']==2) & (subaa1['S2DQ13A']==2) & (subaa1['S2DQ12']==2) & (subaa1['S2DQ13B']==2)]

subaafam4=subaafam40.copy()

print("The percentage of those with no history of alcohol dependence/abuse with no family history of alcoholism:") print(len(subaafam4)/len(subaa1)*100) The percentage of those with no history of alcohol dependence/abuse with no family history of alcoholism: 37.2864

So, there is a similar percentage of respondents that have family history of alcohol abuse/dependence and those that don’t. Those where the alcohol abuse/dependence of family members is unknown have been eliminated from the data set, hence why the two statistics don’t equal 100%.

Now, to use these subsets to look at data.

First, I want to look at the percentage distribution of the responses for the number of alcoholic drinks of days when drinking for those with no personal history of alcohol abuse/dependence and no family history of alcohol abuse/dependence. The number listed in the first column is the number of drinks.

subaafam4['S2AQ8B'] = pandas.to_numeric(subaafam4['S2AQ8B'],errors='coerce').fillna(0).astype(int) print("Percentage of # drinks on days when drinking for those with NO family history:") p2 = subaafam4['S2AQ8B'].value_counts(sort=False, dropna=False,normalize=True) print(p2.sort_index()) Percentage of # drinks on days when drinking for those with NO family history: 0 0.471850 1 0.263560 2 0.157140 3 0.057587 4 0.020512 5 0.006351 6 0.011929 7 0.001459 8 0.001459 9 0.000429 10 0.000944 12 0.002060 15 0.000172 17 0.000086 20 0.000086 24 0.000086 99 0.004291 Name: S2AQ8B, dtype: float64

The largest percentage of people in this data set abstain from drinking (0). 99 accounts for people where drinking is unknown.

For those with family history:

subaafam2['S2AQ8B'] = pandas.to_numeric(subaafam2['S2AQ8B'],errors='coerce').fillna(0).astype(int)

print("Percentage of # drinks on days when drinking for those with family history:") p1 = subaafam2['S2AQ8B'].value_counts(sort=False, dropna=False, normalize=True) print(p1.sort_index()) Percentage of # drinks on days when drinking for those with family history: 0 0.382348 1 0.296109 2 0.182500 3 0.073859 4 0.026697 5 0.010190 6 0.015580 7 0.002358 8 0.002358 9 0.000421 10 0.001937 12 0.001600 13 0.000253 14 0.000084 15 0.000337 20 0.000084 24 0.000168 48 0.000084 99 0.003032 Name: S2AQ8B, dtype: float64

Perhaps interestingly, the group with family history of alcoholism has a somewhat lower percentage of people who abstain from drinking. There seems to be a slight increase in people who have 1-2 drinks. Overall, there’s no striking difference with the previous data set, and it’s hard to gauge without further analysis.

Finally, I wanted to look at the percent distribution of responses for how often these individuals reported drinking in the last 12 months.

print("Key:" + "\n" "0. NA, former drinker or lifetime abstainer" +"\n" "1. Every day" +"\n" "2. Nearly every day" +"\n" "3. 3 to 4 times a week" +"\n" "4. 2 times a week" +"\n" "5. Once a week" +"\n" "6. 2 to 3 times a month" +"\n" "7. Once a month" +"\n" "8. 7 to 11 times in the last year" +"\n" "9. 3 to 6 times in the last year" +"\n" "10. 1 or 2 times in the last year" +"\n" "99. Unknown")

subaafam2['S2AQ8A'] = pandas.to_numeric(subaafam2['S2AQ8A'],errors='coerce').fillna(0).astype(int)

print("Percentage of alcohol consumption frequency for those with family history of alcoholism:") p3 = subaafam2['S2AQ8A'].value_counts(sort=False, dropna=False, normalize=True) print(p3.sort_index())

subaafam4['S2AQ8A'] = pandas.to_numeric(subaafam4['S2AQ8A'],errors='coerce').fillna(0).astype(int)

print("Percentage of alcohol consumption frequency for those with NO family history of alcoholism:") p4 = subaafam4['S2AQ8A'].value_counts(sort=False, dropna=False, normalize=True) print(p4.sort_index()) Key: 0. NA, former drinker or lifetime abstainer 1. Every day 2. Nearly every day 3. 3 to 4 times a week 4. 2 times a week 5. Once a week 6. 2 to 3 times a month 7. Once a month 8. 7 to 11 times in the last year 9. 3 to 6 times in the last year 10. 1 or 2 times in the last year 99. Unknown Percentage of alcohol consumption frequency for those with family history of alcoholism: 0 0.382348 1 0.029308 2 0.016928 3 0.043119 4 0.045393 5 0.064511 6 0.081270 7 0.064426 8 0.050109 9 0.104514 10 0.115968 99 0.002105 Name: S2AQ8A, dtype: float64 Percentage of alcohol consumption frequency for those with NO family history of alcoholism: 0 0.471850 1 0.030639 2 0.017079 3 0.035530 4 0.049176 5 0.064796 6 0.067456 7 0.061535 8 0.037590 9 0.069945 10 0.089341 99 0.005064 Name: S2AQ8A, dtype: float64

Once again, as we would expect given the previous data set, we see the same trend that the individuals with no family history of alcoholism have a higher percentage of those abstaining from alcohol than those with family history. On their face, the other percentages seem fairly similar.

In future data analysis, I might need to combine the # of drinks consumed with the alcohol consumption frequency to form a rate for each individual. This rate will assist in future analysis as it would give me a better idea of typical alcohol consumption.

0 notes

kpenlearnspython-blog · 6 years ago

Text

Data Management and Visualization Week 1: Topic Selection

1) Data set: National Epidemiological Survey of Drug Use and Health (NESARC)

2) Primary research question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

Primary hypothesis: Despite having no history or alcohol abuse or dependence, having a familial history with alcoholism increases rate of alcohol consumption.

Secondary research question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation, present vs. absent alcoholic parent)

Secondary hypothesis: A close relationship with someone with alcohol dependence is more likely to result in an increased rate of alcohol consumption.

3) Literature Review:

Google Scholar Search: “family history alcoholism affecting alcohol consumption”

O’Malley, S. S., et al. Effects of Family Drinking History and Expectancies on Responses to Alcohol in Men. Journal of Studies on Alcohol. 1985. 46: 289-297.

24 men with history of parental alcoholism compared with 24 matched controls without parental alcoholism

Nonproblem drinkers

Those with parental alcoholism reported feeling less drunk that those without upon drinking the same amount, though levels of tolerance and BAL was similar.

Cotton, N. S. The familial incidence of alcoholism: a review. Journal of Studies on Alcohol. 1979. 1: 89-116.

Reviewed lots of other studies.

Every family with a history of alcohol consumption more likely to become an alcoholic.

Google Scholar Search: “alcoholic parent alcohol consumption”

Dube, S. R., et al. Adverse childhood experiences and personal alcohol abuse as an adult. Addictive Behaviors. 2002. 27: 713-725.

Alcohol abuse linked to childhood abuse and family dysfunction

Not much information about adverse childhood experiences linked to later alcohol abuse of individual

Tested 8 different adverse childhood experiences. Experiencing these were all associated with higher incidence of alcohol abuse.

Compared to people with none of these incidences, association with alcoholism was doubled for those that had one.For those with multiple, it increased four-fold.

Looked at verbal abuse, physical abuse, sexual abuse, battered mother, household substance abuse, mental illness in household, parental separation or divorce, incarcerated household members

Found that specifically for alcoholism, if you had one parent who was an alcoholic, you were much more likely to have an alcohol problem

Less prominent in women and men, but the trend held

Goodwin, D. W., et al. Alcohol problems in adoptees raised apart from alcoholic biological parents. Archives of General Psychiatry. 1973. 28: 238-243.

Only looked at 55 men

Children separated from alcoholic parent.

Still much more likely than the control to develop into alcoholics.

4) Literature Review Summary:

All the literature indicates that people who had alcoholic parents are more likely to become alcoholics and tend to feel less drunk on the same amount of alcohol compared to a control group. Both men and women were considered in these studies.

While I am going to try to center my study on those that do not have a reported alcohol dependence or abuse problem, this literature survey suggests that they will likely still be genetically predisposed to drink more.

Needed data outline:

Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?

· Hypothesis: Despite having no history or alcohol abuse or dependence, having a familial history with alcoholism increases rate of alcohol consumption.

· Needed data:

o Alcohol consumption rate

§ Have had at least 1 drink / 313-313

§ Drink at least 12 drinks in 12 months / 314-314

§ Drink at least 1 drink in 12 months / 315-315

§ Drinking status / 316-316

§ How often drank alcohol in last 12 months / 397-398

§ Number of drinks of alcohol on days when drinking / 399-400

§ Largest number of drinks 12 months / 401-402

§ How often drank large number / 403-404

§ How often drank 5+ / 405-406

§ How often drank 4+ / 407-408

o Family member was an alcoholic

§ Only considering relatives of previous generation or before. Not siblings or children.

§ Alcoholic dad / 653-653

§ Alcoholic mom / 654-654

§ Alcoholic uncle / 679-679, 689-689

§ Alcoholic aunt / 684-684, 694-694

§ Alcoholic grandpa / 695-695, 697-697

§ Alcoholic grandma / 696-696, 698-698

o Does this hold true for people who are not currently alcoholics?

§ Alcohol abuse/dependence in last 12 months / 3648-3648

§ Alcohol abuse/dependence prior to last 12 months / 3649-3649

· Correcting factors:

o Income (correct primary data set if possible)

§ 179-185

§ 186-187

§ 188-188

§ 189-195

§ 196-197

§ 198-198

§ 199-205

§ 206-207

§ 208-208

o Age (correct primary data set if possible)

§ 71-72

§ 73-74

§ 75-78

o Ethnicity (might need to account for ethnicities that drink less)

§ 81-81 until 89-90

o Sex (not sure if this matters)

§ 79-79

· Secondary question:

o Does the closeness of the relationship affect this correlation? (Parent vs more distant relation, present vs. absent alcoholic parent)

§ Hypothesis: A close relationship with someone with alcohol dependence is more likely to result in an increased rate of alcohol consumption.

§ Focus on those with biological parents since only 3% didn’t live with a biological parent

§ Needed data:

· Parental association growing up (Whether one or both parents was present during upbringing):

o Lived with at least 1 biological parent before age 18. / 94-94

o Biological father ever lived in household before respondent was 18. / 95-95

o Did biological or adoptive parents get divorced before 18? / 101-101

o Age when biological parents stopped living together / 102-103

o Ever lived with a step-parent before age 18 / 105-105

o Age when started living with step parent / 106-107

Can use previous information on family stuff for rest.

1 note · View note