kpenlearnspython-blog
kpenlearnspython-blog
Hopefully learning python
12 posts
Don't wanna be here? Send us removal request.
kpenlearnspython-blog · 6 years ago
Text
Lesson 3 Week 4
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
Hypothesis for this lesson:
Those who drink more than 30 drinks a month with documented history of alcoholism are more likely to have a family history of alcoholism.
Summary:
I decided to look at the response variable with either a yes (1) or no (0) to whether the individual consumed >30 drinks/month and the explanatory variables of family history of alcoholism (yes=1, no=0), centered income, and centered age.
After adjusting for these factors, the odds of drinking more than 30 drinks/month were approximately the same for those with alcoholic family than those without (OR=1.01, 95% CI: 1.000989-1.018870). This was technically statistically significant, at p-value=0.029, but this only became significant when additional variables were factored in (considering only family history of alcoholism, the p-value is 0.964).
Income, sex, and age were also predicted to be statistically significant with p-values of approximately 0, but when you look at the odds ratios, the difference between the two groups appears fairly slight. Income has an OR of 1, sex an OR of 0.934 with a 95% CI range of 0.928974 - 0.939662 (so probability of drinking more than 30 drinks/month is lower in men than in women), and age also has an OR of roughly 1. 
Do the results support the hypothesis?
Technically this data says there’s a statistically significant difference and it supports my hypothesis. However, it also says that the difference is very minimal.
Evidence of confounding?
Somewhat. Alcoholic family only has a p-value of less than 0.05 when sex and age are factored in. Without it, it’s not significant. That’s more like the opposite of confounding, but I still think it’s interesting. It makes me wonder if it’s real.
Logistic regression output:
                           OLS Regression Results                             ============================================================================== Dep. Variable:            DAYDRINKING   R-squared:                       0.021 Model:                            OLS   Adj. R-squared:                  0.021 Method:                 Least Squares   F-statistic:                     165.2 Date:                Fri, 22 Mar 2019   Prob (F-statistic):          3.35e-140 Time:                        16:39:20   Log-Likelihood:                 138.17 No. Observations:               31250   AIC:                            -266.3 Df Residuals:                   31245   BIC:                            -224.6 Df Model:                           4                                         Covariance Type:            nonrobust                                         =================================================================================                    coef    std err          t      P>|t|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept         0.1062      0.002     45.255      0.000       0.102       0.111 ADJPERSINCOME  2.076e-07   3.62e-08      5.730      0.000    1.37e-07    2.79e-07 SEX              -0.0680      0.003    -23.287      0.000      -0.074      -0.062 AGE_c             0.0004   7.22e-05      4.858      0.000       0.000       0.000 AAFAM             0.0098      0.005      2.179      0.029       0.001       0.019 ============================================================================== Omnibus:                    21705.905   Durbin-Watson:                   2.004 Prob(Omnibus):                  0.000   Jarque-Bera (JB):           204857.060 Skew:                           3.476   Prob(JB):                         0.00 Kurtosis:                      13.441   Cond. No.                     1.28e+05 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.28e+05. This might indicate that there are strong multicollinearity or other numerical problems. Odds Ratio Intercept       1.112039 ADJPERSINCOME   1.000000 SEX             0.934303 AGE_c           1.000351 AAFAM           1.009890 dtype: float64               Lower CI  Upper CI       OR Intercept      1.106936  1.117165 1.112039 ADJPERSINCOME  1.000000  1.000000 1.000000 SEX            0.928974  0.939662 0.934303 AGE_c          1.000209  1.000492 1.000351 AAFAM          1.000989  1.018870 1.009890                            OLS Regression Results                             ============================================================================== Dep. Variable:            DAYDRINKING   R-squared:                       0.020 Model:                            OLS   Adj. R-squared:                  0.020 Method:                 Least Squares   F-statistic:                     209.1 Date:                Fri, 22 Mar 2019   Prob (F-statistic):          2.70e-134 Time:                        16:39:20   Log-Likelihood:                 121.76 No. Observations:               31250   AIC:                            -235.5 Df Residuals:                   31246   BIC:                            -202.1 Df Model:                           3                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept      0.1082      0.002     46.653      0.000       0.104       0.113 AAFAM          0.0099      0.005      2.200      0.028       0.001       0.019 SEX           -0.0711      0.003    -24.823      0.000      -0.077      -0.066 AGE_c          0.0004   7.22e-05      5.033      0.000       0.000       0.001 ============================================================================== Omnibus:                    21737.572   Durbin-Watson:                   2.004 Prob(Omnibus):                  0.000   Jarque-Bera (JB):           205558.677 Skew:                           3.482   Prob(JB):                         0.00 Kurtosis:                      13.458   Cond. No.                         63.4 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.                            OLS Regression Results                             ============================================================================== Dep. Variable:            DAYDRINKING   R-squared:                       0.000 Model:                            OLS   Adj. R-squared:                  0.000 Method:                 Least Squares   F-statistic:                     5.454 Date:                Fri, 22 Mar 2019   Prob (F-statistic):            0.00428 Time:                        16:39:20   Log-Likelihood:                -183.37 No. Observations:               31250   AIC:                             372.7 Df Residuals:                   31247   BIC:                             397.8 Df Model:                           2                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept      0.0631      0.001     43.366      0.000       0.060       0.066 AGE_c          0.0002   7.27e-05      3.302      0.001    9.76e-05       0.000 AAFAM          0.0015      0.005      0.333      0.739      -0.007       0.010 ============================================================================== Omnibus:                    22290.028   Durbin-Watson:                   2.003 Prob(Omnibus):                  0.000   Jarque-Bera (JB):           220736.719 Skew:                           3.586   Prob(JB):                         0.00 Kurtosis:                      13.867   Cond. No.                         63.4 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.                            OLS Regression Results                             ============================================================================== Dep. Variable:            DAYDRINKING   R-squared:                       0.000 Model:                            OLS   Adj. R-squared:                 -0.000 Method:                 Least Squares   F-statistic:                  0.002066 Date:                Fri, 22 Mar 2019   Prob (F-statistic):              0.964 Time:                        16:39:20   Log-Likelihood:                -188.82 No. Observations:               31250   AIC:                             381.6 Df Residuals:                   31248   BIC:                             398.3 Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept      0.0633      0.001     43.512      0.000       0.060       0.066 AAFAM         -0.0002      0.005     -0.045      0.964      -0.009       0.009 ============================================================================== Omnibus:                    22300.061   Durbin-Watson:                   2.003 Prob(Omnibus):                  0.000   Jarque-Bera (JB):           221025.956 Skew:                           3.588   Prob(JB):                         0.00 Kurtosis:                      13.874   Cond. No.                         3.32 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Raw code:
parents = np.dstack((subaa1["S2DQ1"],subaa1["S2DQ2"])) #making variables of interest into a single array print(parents.shape) #double checking parents2 = np.nansum(parents,2) #adding the two elements together on the correct dimension print(parents2.T.shape) #checking that if I transpose it again it's now in a column once more subaa1["AAFAMPAR2"]=parents2.T #adding the column into the dataset as a new variable extfam = np.dstack((subaa1["S2DQ7C2"],subaa1["S2DQ8C2"],subaa1["S2DQ9C2"],subaa1["S2DQ10C2"],subaa1["S2DQ11"],subaa1["S2DQ12"],subaa1["S2DQ13A"],subaa1["S2DQ13B"])) print(extfam.shape) extfam2=np.nansum(extfam, 2) print(extfam2.shape) subaa1["AAFAMEXT2"]=extfam2.T
def AAFAM(row):    if row["AAFAMPAR2"]>=1 and row["AAFAMEXT2"]>=1:        return 1 #alcoholic family    else:        return 0 #No known alcoholic family history
subaa1["AAFAM"] = subaa1.apply(lambda row: AAFAM(row), axis=1)
def DAYDRINKING(row):    if row["DRINKMO"]>30:        return 1 #drinks approximately every day or more    else:        return 0 #drinks less than every day
subaa1["DAYDRINKING"] = subaa1.apply(lambda row: DAYDRINKING(row), axis=1)
reg1=smf.ols("DAYDRINKING ~ ADJPERSINCOME + SEX+ AGE_c + AAFAM", data=subaa1).fit() print(reg1.summary()) print("Odds Ratio") print(np.exp(reg1.params)) #Can also get a confidence interval params1=reg1.params conf1=reg1.conf_int() conf1["OR"]=params1 conf1.columns=["Lower CI", "Upper CI", "OR"] #The confidence intervals for the response variables overlap. Cannot tell which one is more strongly associated. print(np.exp(conf1))
reg2=smf.ols("DAYDRINKING ~ AAFAM + SEX +AGE_c", data=subaa1).fit() print(reg2.summary()) #AAFAM appears to have no affect by itself
reg3=smf.ols("DAYDRINKING ~  AGE_c + AAFAM", data=subaa1).fit() print(reg3.summary()) #SEX confounds income and alcoholic family
reg4=smf.ols("DAYDRINKING ~ AAFAM", data=subaa1).fit() print(reg4.summary()) #AAFAM is still signifiant without income
0 notes
kpenlearnspython-blog · 6 years ago
Text
Lesson 3 Week 3
Full disclosure, I’m sick this week, and I don’t have the energy to make this data look any better. In summary, as best I can tell, these things don’t correlate very well. If you have any advice for something obvious I did wrong that resulted in such poor fit, I’d appreciate the feedback.
Second disclosure...just watched the first video for next week and I learned that since ALCCHOICE is a categorical variable with more than two terms, what I did for it was incorrect. I missed this in the Week 3 lectures. So, while I executed everything okay, I did the analysis incorrectly.
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
“DRINKMO” is my number of drinks/month category. subaa1 has already been segmented from the primary data to contain only those who have no reported alcohol abuse or dependence.
I decided to just stick with three explanatory variables (two quantitative, one more categorical) to see what influence they would have. I initially tried to factor in more, but it was too much data and the last graph wouldn’t run.
Summary:
I wanted to look at how the explanatory variables total household income, age, and alcohol choice (non-hard, hard, both, non-drinking) would affect the response variable of number of drinks consumed per month. Total household income was my primary explanatory variable.
When I initially looked at only the relationship between the total household income zeroed around the mean and drinks/month, I get a low p-value of ~0, but a really low coefficient of 3.482e-05. The Rsquared value is also very low at 0.005. So, while these things are correlated, it doesn’t seem very strong. I did try to adjust this my introducing a quadratic formula, but it didn’t really improve the model much.
I next tried to incorporate age zeroed around the mean with this model to see if it would change anything. With this, the p-values for both age and income were both ~0. Age had a slightly more significant coefficient at -0.0272, while income remained low at 3.506e-05. These two things do not really seem to affect each other much. The Rsquared value remained the same.
When I added in the additional variable of alcohol choice, age no longer became significant (suggesting that perhaps alcohol choice is a confounding variable for age). While alcohol choice and total household income still had p-values of ~0, the age p-value changed to 0.451. The alcohol choice coefficient was -2.9505, making this actually the explanatory variable that looks like it has the strongest influence. The income coefficient still remained low at 2.659e-05. The Rsquared improved slightly, at 0.044, so now the model can explain about 4% of the variability.
Did the results support the hypothesis between the primary explanatory and response variable?
While I have not found a confounding variable for the association between income and drinks/month, the fact that the coefficients seems so weak, the qqplot shows that the data aren’t clearly following a normal distribution. The standardized residuals are also showing that MANY values are higher than the expected standard deviation.
Evidence of confounding?
There is evidence that the type of alcohol consumed acts as a confounding variable for age, as the p-value for age increases once the type of alcohol consumed is considered.
Regression Diagnostic plots:
Tumblr media
My QQ-plot with all of the variables considered clearly shows deviation at the higher theoretical quantiles. I tried switching things to quadratics, and that didn’t help. But clearly, there are other variables at play that we’re not accounting for here. This is sort of the general theme for all of this data...
Tumblr media
As for standardized residuals, this is supposed to tell you whether your values are in an appropriate standard deviation from the mean. It’s pretty clear that more than the acceptable range are outside of the absolute value of >2.5. This current model isn’t great and I’m probably missing important variables or confounding variables.
Tumblr media
This is my regression diagnostic for income. Hard to really get much from this since the fit is poor. Correlation is non-linear.
Tumblr media
Once again, this has the same issue with really high residuals. Adding this variable to the rest of the data doesn’t seem to make it a linear relationship (though it does look better than the previous plot). Once again, I’m clearly missing additional explanatory variables.
Tumblr media
There are some very high residuals. It almost looks like a linear relationship, but this it the one variable I know isn’t statistically significant. 
Tumblr media
This is sort of the icing on the cake in terms of demonstrating that this model is bad. There are so many things considered outliers that it’s really hard to evaluate anything.
Multiple regression output:
                           OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.005 Model:                            OLS   Adj. R-squared:                  0.005 Method:                 Least Squares   F-statistic:                     148.6 Date:                Sun, 17 Mar 2019   Prob (F-statistic):           4.15e-34 Time:                        12:01:35   Log-Likelihood:            -1.3601e+05 No. Observations:               31052   AIC:                         2.720e+05 Df Residuals:                   31050   BIC:                         2.720e+05 Df Model:                           1                                         Covariance Type:            nonrobust                                         =================================================================================                    coef    std err          t      P>|t|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept         6.5884      0.110     60.094      0.000       6.373       6.803 ADJPERSINCOME  3.482e-05   2.86e-06     12.191      0.000    2.92e-05    4.04e-05 ============================================================================== Omnibus:                    40363.894   Durbin-Watson:                   1.986 Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10592652.775 Skew:                           7.209   Prob(JB):                         0.00 Kurtosis:                      92.326   Cond. No.                     3.84e+04 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.84e+04. This might indicate that there are strong multicollinearity or other numerical problems.
                           OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.005 Model:                            OLS   Adj. R-squared:                  0.005 Method:                 Least Squares   F-statistic:                     85.52 Date:                Sun, 17 Mar 2019   Prob (F-statistic):           9.11e-38 Time:                        12:05:05   Log-Likelihood:            -1.3600e+05 No. Observations:               31052   AIC:                         2.720e+05 Df Residuals:                   31049   BIC:                         2.720e+05 Df Model:                           2                                         Covariance Type:            nonrobust                                         =================================================================================                    coef    std err          t      P>|t|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept         6.5884      0.110     60.116      0.000       6.374       6.803 ADJPERSINCOME  3.506e-05   2.86e-06     12.278      0.000    2.95e-05    4.07e-05 AGE_c            -0.0272      0.006     -4.725      0.000      -0.038      -0.016 ============================================================================== Omnibus:                    40351.209   Durbin-Watson:                   1.987 Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10568506.250 Skew:                           7.206   Prob(JB):                         0.00 Kurtosis:                      92.223   Cond. No.                     3.84e+04 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.84e+04. This might indicate that there are strong multicollinearity or other numerical problems.
                           OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.044 Model:                            OLS   Adj. R-squared:                  0.044 Method:                 Least Squares   F-statistic:                     476.4 Date:                Sun, 17 Mar 2019   Prob (F-statistic):          1.12e-302 Time:                        12:07:12   Log-Likelihood:            -1.3538e+05 No. Observations:               31052   AIC:                         2.708e+05 Df Residuals:                   31048   BIC:                         2.708e+05 Df Model:                           3                                         Covariance Type:            nonrobust                                         =================================================================================                    coef    std err          t      P>|t|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept        11.8562      0.184     64.563      0.000      11.496      12.216 ADJPERSINCOME  2.659e-05   2.81e-06      9.463      0.000    2.11e-05    3.21e-05 AGE_c             0.0043      0.006      0.753      0.451      -0.007       0.015 ALCCHOICE        -2.9505      0.083    -35.374      0.000      -3.114      -2.787 ============================================================================== Omnibus:                    40770.198   Durbin-Watson:                   1.991 Prob(Omnibus):                  0.000   Jarque-Bera (JB):         11585537.749 Skew:                           7.316   Prob(JB):                         0.00 Kurtosis:                      96.490   Cond. No.                     7.02e+04 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 7.02e+04. This might indicate that there are strong multicollinearity or other numerical problems.
All raw code:
subaa1_2=subaa1[["DRINKMO","ADJPERSINCOME","SEX","ETH", "AGE_c", "AAFAM", "ALCCHOICE"]].dropna()
modeltest1 = smf.ols(formula="DRINKMO ~ ADJPERSINCOME", data=subaa1_2).fit() print(modeltest1.summary()) fig1=sm.qqplot(modeltest1.resid, line='r') #There is really high deviation at the upper end of the mean, somewhat at lower stdres1 = pd.DataFrame(modeltest1.resid_pearson) #convert array of standardized residuals to dataframe. Reg3 has the results of the data analysis. resid tells python to use standardized residuals fig1_1=plt.plot(stdres1, 'o', ls='None') #generate a plot of standardized residuals. 'o' tells python to use dots. ls='None' tells python not to connec the markers l = plt.axhline(y=0, color='r') #draws horizontal line on the graph plt.ylabel('Standardized Residual') plt.xlabel('Observation Number') print(fig1_1) #Tons are falling out of residual plot. So the current model is unacceptable. Does adding more help?
modeltest2 = smf.ols(formula="DRINKMO ~ ADJPERSINCOME + AGE_c", data=subaa1_2).fit() print(modeltest2.summary()) fig2=sm.qqplot(modeltest2.resid, line='r') stdres2 = pd.DataFrame(modeltest2.resid_pearson) #convert array of standardized residuals to dataframe. Reg3 has the results of the data analysis. resid tells python to use standardized residuals fig2_1=plt.plot(stdres2, 'o', ls='None') #generate a plot of standardized residuals. 'o' tells python to use dots. ls='None' tells python not to connec the markers l = plt.axhline(y=0, color='r') #draws horizontal line on the graph plt.ylabel('Standardized Residual') plt.xlabel('Observation Number') print(fig2_1)
modeltest3 = smf.ols(formula="DRINKMO ~ ADJPERSINCOME + AGE_c + ALCCHOICE", data=subaa1_2).fit() print(modeltest3.summary()) fig3=sm.qqplot(modeltest3.resid, line='r') stdres3 = pd.DataFrame(modeltest3.resid_pearson) #convert array of standardized residuals to dataframe. Reg3 has the results of the data analysis. resid tells python to use standardized residuals fig3_1=plt.plot(stdres3, 'o', ls='None') #generate a plot of standardized residuals. 'o' tells python to use dots. ls='None' tells python not to connec the markers l = plt.axhline(y=0, color='r') #draws horizontal line on the graph plt.ylabel('Standardized Residual') plt.xlabel('Observation Number') print(fig3_1) #when I add alcchoice in, age is no longer relevant. fig3_2 = plt.figure(figsize=(12,8)) #numbers specify size of the plot image in pixels fig3_2 = sm.graphics.plot_regress_exog(modeltest3, "AGE_c", fig=fig3_2) #put in residual results and the explanatory variable that you want to plot print(fig3_2)
fig3_3 = sm.graphics.influence_plot(modeltest3, size=8) print(fig3_3)
0 notes
kpenlearnspython-blog · 6 years ago
Text
Lesson 3 Week 2
Truthfully, I’m not sure if I understood this assignment since I thought we were going to learn how to deal with confounding variables, but this seems to just be asking for general linear regression. That I can do, but I feel like I’m missing something?
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
One confounding variable that I could see affecting number of drinks consumed per month vs family history is personal income. I wanted to look at the linear regression of number of drinks/month (response variable) vs the personal income (explanatory variable).
subaa1 is my main dataset (in previous posts) where I have filtered out people with a history of alcohol abuse/dependence.
I centered the personal income data around the mean.
subaa1["ADJPERSINCOME"]=(subaa1["S1Q10A"]-subaa1["S1Q10A"].mean())
I then just narrowed my dataset to the DRINKMO data (which I have previously derived as the average # of drinks per month for each participant) and ADJPERSINCOME. I made sure everything was numeric.
subaa14=subaa1[["DRINKMO","ADJPERSINCOME"]].dropna() subaa14["DRINKMO"]=pd.to_numeric(subaa14["DRINMO"]) subaa14["ADJPERSINCOME"]=pd.to_numeric(subaa14["ADJPERSINCOME"])
Then I ran the regression analysis.
modeldrinkincome = smf.ols(formula="DRINKMO ~ ADJPERSINCOME", data=subaa14).fit() print(modeldrinkincome.summary()) sb.regplot(y="DRINKMO", x="ADJPERSINCOME", data=subaa14) plt.xlabel("Income Adjusted Around Mean") plt.ylabel("Number of drinks/month")
                           OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.005 Model:                            OLS   Adj. R-squared:                  0.005 Method:                 Least Squares   F-statistic:                     148.6 Date:                Sun, 10 Mar 2019   Prob (F-statistic):           4.15e-34 Time:                        12:18:56   Log-Likelihood:            -1.3601e+05 No. Observations:               31052   AIC:                         2.720e+05 Df Residuals:                   31050   BIC:                         2.720e+05 Df Model:                           1                                         Covariance Type:            nonrobust                                         =================================================================================                    coef    std err          t      P>|t|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept         6.5884      0.110     60.094      0.000       6.373       6.803 ADJPERSINCOME  3.482e-05   2.86e-06     12.191      0.000    2.92e-05    4.04e-05 ============================================================================== Omnibus:                    40363.894   Durbin-Watson:                   1.986 Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10592652.775 Skew:                           7.209   Prob(JB):                         0.00 Kurtosis:                      92.326   Cond. No.                     3.84e+04 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.84e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Tumblr media
The general p-value is 4.15e-34, so it is statistically significant. The regression coefficient is 3.482e-5, and is the slope of the line. The p-value of the explanatory variable to the response variable is 0 (probably not exactly, but that’s the closest it could give me), so it should be less than 0.05. Finally, the  So, while income does seem to have a positive relation to number of drinks/month, it’s very weak. This is further emphasized by the R-squared value, which is 0.005. This is so far away from 1 that it indicates that the relationship there just isn’t very strong.
I tried to adjust this data down to a certain income range and only for people that drink, as I worried that the higher values weren’t truly representative and were skewing the data. 
subaa14_1 = subaa14[(subaa14["ADJPERSINCOME"]<=50000)&(subaa14["DRINKMO"]>=1)] subaa14_1["ADJPERSINCOME"].max() subaa14_1["DRINKMO"].min() modeldrinkincome = smf.ols(formula="DRINKMO ~ ADJPERSINCOME", data=subaa14_1).fit() print(modeldrinkincome.summary()) sb.regplot(y="DRINKMO", x="ADJPERSINCOME", data=subaa14_1) plt.xlabel("Income Adjusted Around Mean") plt.ylabel("Number of drinks/month")
                           OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.000 Model:                            OLS   Adj. R-squared:                 -0.000 Method:                 Least Squares   F-statistic:                    0.8429 Date:                Sun, 10 Mar 2019   Prob (F-statistic):              0.359 Time:                        12:33:21   Log-Likelihood:                -49018. No. Observations:               10208   AIC:                         9.804e+04 Df Residuals:                   10206   BIC:                         9.805e+04 Df Model:                           1                                         Covariance Type:            nonrobust                                         =================================================================================                    coef    std err          t      P>|t|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept        18.3495      0.292     62.930      0.000      17.778      18.921 ADJPERSINCOME -1.445e-05   1.57e-05     -0.918      0.359   -4.53e-05    1.64e-05 ============================================================================== Omnibus:                    10113.534   Durbin-Watson:                   1.947 Prob(Omnibus):                  0.000   Jarque-Bera (JB):           701087.094 Skew:                           4.793   Prob(JB):                         0.00 Kurtosis:                      42.451   Cond. No.                     1.85e+04 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.85e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Tumblr media
With this adjustment, the p-value is now 0.359, so not below 0.05. The slope is now negative, and in general, it seems there’s really no strong correlation, as we expected before.
Based upon this data, I’m going to assume that income isn’t too much of a confounding variable from my initial dataset.
0 notes
kpenlearnspython-blog · 6 years ago
Text
Writing About Your Data
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
1) My sample:
The NESARC data set was collected in 2001-2002 among people in the US that were 18 years or older and that were non-institutionalized. They had 43,093 respondents.
Specifically, I’m looking at a subset of this data for people with no history of alcohol abuse or dependence. 
The level of analysis studied would be in group (either those with and without family history of alcohol abuse/dependence OR subcategories of family history of alcoholism).
For those with no history of alcohol abuse or dependence, there are 31250 observations.
2) The data collection procedure:
The NESARC data was gathered by a survey.
The data was collected by a diagnostic interview where questions were asked in typically one hour. Diagnoses of disorders was made according to DSM-IV. It did not allow skip-outs of questions even when a subject had answered enough questions to confirm a diagnosis. They asked about these symptoms in the last 12 months or prior to the last 12 months to identify if there were instances of full or partial remission. They had a test-retest, where a subset (~400) of the respondents were asked to re-do the survey again, but this time with a different interviewer.
The purpose of designing the survey like this was because no one had done such an extensive survey before. In addition to sample size, many previous studies diagnosed people with certain disorders with either outdated or questionable methods.
The data were collected from 2001-2002.
The respondents of the data set were in the United States (including Alaska and Hawaii) and the District of Colombia.
3) My variables and how I managed them to address my research question:
My explanatory variable is family history of alcoholism. My response variable is how many drinks/month.
In terms of response scales, most of my values are quantitative (rolling numbers), a yes or no (1 or 0), or categories.
I’ve used “How often have you drank alcohol in the last 12 months” (S2AQ8A) and the “Number of drinks on days when drinking” (S2AQ8B) to make a drinks/month (DRINKMO) category to get an idea of how much alcohol is actually consumed.
I also eliminated people who have been previously diagnosed with alcohol abuse or dependence from my data set.
I have tried to subset family relationships into parents and more distant relatives (grandparents, uncles, aunts), or some combination. I think it might be advisable to generalize this category a bit more in the future or do some different type of subsetting, but I haven’t decided how to do it yet.
I tried to look at how age affected these results in the previous lesson by dividing age into categories of younger than 25, between 25 and 50, and older than 50. This did not have much affect on the data.
It might be further more interesting to divide my data into people who drink hard alcohol (some sort of liquor) vs people who drink things with a lower alcohol content (like beer and wine). Obviously, these drinks affect you differently, so it would be interesting to see if my trends from the general data set remain true here (from my previous data, there really isn’t a big difference between those with no family history of alcoholism and those that have family history EXCEPT in the case with two alcoholic parents, though even there I feel like the error bar is suspiciously large).
Other potential confounding factors could be ethnicity, income, and sex.
0 notes
kpenlearnspython-blog · 6 years ago
Text
Week 4: Moderators
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
I want to see if age group could act as a moderator for the relationship between family history of alcoholism and drinks/month.
subaa1 is a subset of the data that I’ve previously made that contains only those who have not displayed any history of alcohol abuse or dependence.
All code will be in block text. Code is italicized, while the output is not.
subaa1["AGE"]=2002-subaa1["DOBY"]
subaa1["AGE"].head(n=10)
subaa13=subaa1[["DRINKMO", "AAFAM2", "AGE"]].dropna()
subaa13.head(n=25)
subaa1_25=subaa13[(subaa13["AGE"]<=25)] subaa1_50=subaa13[(subaa13["AGE"]>25) & (subaa13["AGE"]<=50)] subaa1_old=subaa13[(subaa13["AGE"]>50)]
#25 and under print("association between family history of alcoholism and drinks/month for those 25 and under") aafammod25 = smf.ols(formula="DRINKMO~C(AAFAM2)", data=subaa1_25).fit() print(aafammod25.summary())
mc_aafam25 = multi.MultiComparison(subaa1_25["DRINKMO"], subaa1_25["AAFAM2"]) res_aafam25 = mc_aafam25.tukeyhsd() #Request the test print(res_aafam25.summary())
sb.factorplot(x='AAFAM2', y='DRINKMO', data=subaa1_25, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Avg of Drinks/Month') plt.xticks(rotation = 45)
print(subaa1_25.groupby("AAFAM2").mean()) print(subaa1_25.groupby("AAFAM2").std())
#26-50 print("association between family history of alcoholism and drinks/month for those from 26-50") aafammod50 = smf.ols(formula="DRINKMO~C(AAFAM2)", data=subaa1_50).fit() print(aafammod50.summary())
mc_aafam50 = multi.MultiComparison(subaa1_50["DRINKMO"], subaa1_50["AAFAM2"]) res_aafam50 = mc_aafam50.tukeyhsd() #Request the test print(res_aafam50.summary())
sb.factorplot(x='AAFAM2', y='DRINKMO', data=subaa1_50, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Avg of Drinks/Month') plt.xticks(rotation = 45)
print(subaa1_50.groupby("AAFAM2").mean()) print(subaa1_50.groupby("AAFAM2").std())
#50+ print("association between family history of alcoholism and drinks/month for those over 50") aafammodold = smf.ols(formula="DRINKMO~C(AAFAM2)", data=subaa1_old).fit() print(aafammodold.summary())
mc_aafamold = multi.MultiComparison(subaa1_old["DRINKMO"], subaa1_old["AAFAM2"]) res_aafamold = mc_aafamold.tukeyhsd() #Request the test print(res_aafamold.summary())
sb.factorplot(x='AAFAM2', y='DRINKMO', data=subaa1_old, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Avg of Drinks/Month') plt.xticks(rotation = 45)
print(subaa1_old.groupby("AAFAM2").mean()) print(subaa1_old.groupby("AAFAM2").std())
association between family history of alcoholism and drinks/month for those 25 and under                            OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.003 Model:                            OLS   Adj. R-squared:                  0.002 Method:                 Least Squares   F-statistic:                     2.115 Date:                Thu, 21 Feb 2019   Prob (F-statistic):             0.0485 Time:                        21:51:49   Log-Likelihood:                -16800. No. Observations:                3724   AIC:                         3.361e+04 Df Residuals:                    3717   BIC:                         3.366e+04 Df Model:                           6                                         Covariance Type:            nonrobust                                         ==============================================================================================                                 coef    std err          t      P>|t|      [0.025      0.975] ---------------------------------------------------------------------------------------------- Intercept                      7.0355      1.604      4.387      0.000       3.891      10.180 C(AAFAM2)[T.2 Par]             2.8673      5.072      0.565      0.572      -7.077      12.811 C(AAFAM2)[T.1 ExtRel]         -1.7358      1.858     -0.934      0.350      -5.379       1.908 C(AAFAM2)[T.>1 ExtRel]         0.2085      1.916      0.109      0.913      -3.547       3.964 C(AAFAM2)[T.1 Par, ExtRel]     2.4260      1.919      1.264      0.206      -1.336       6.188 C(AAFAM2)[T.2 Par, ExtRel]     1.4316      2.995      0.478      0.633      -4.440       7.303 C(AAFAM2)[T.None known]       -1.1260      1.678     -0.671      0.502      -4.415       2.163 ============================================================================== Omnibus:                     5663.796   Durbin-Watson:                   1.950 Prob(Omnibus):                  0.000   Jarque-Bera (JB):          3077895.731 Skew:                           9.325   Prob(JB):                         0.00 Kurtosis:                     142.600   Cond. No.                         17.6 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.     Multiple Comparison of Means - Tukey HSD,FWER=0.05     ============================================================    group1        group2    meandiff  lower    upper  reject ------------------------------------------------------------   1 ExtRel       1 Par      1.7358  -3.7459   7.2174 False   1 ExtRel   1 Par, ExtRel  4.1617  -0.0004   8.3239 False   1 ExtRel       2 Par      4.603   -9.8582  19.0642 False   1 ExtRel   2 Par, ExtRel  3.1674  -4.7907  11.1255 False   1 ExtRel     >1 ExtRel    1.9442  -2.2048   6.0933 False   1 ExtRel     None known   0.6098  -2.5166   3.7362 False    1 Par     1 Par, ExtRel  2.426   -3.2347   8.0867 False    1 Par         2 Par      2.8673  -12.0942 17.8288 False    1 Par     2 Par, ExtRel  1.4316  -7.4031  10.2663 False    1 Par       >1 ExtRel    0.2085  -5.4426   5.8596 False    1 Par       None known   -1.126  -6.0752   3.8233 False 1 Par, ExtRel     2 Par      0.4413  -14.0887 14.9713 False 1 Par, ExtRel 2 Par, ExtRel -0.9944  -9.0768   7.0881 False 1 Par, ExtRel   >1 ExtRel   -2.2175  -6.6003   2.1653 False 1 Par, ExtRel   None known  -3.5519  -6.9826  -0.1213  True    2 Par     2 Par, ExtRel -1.4357  -17.4709 14.5996 False    2 Par       >1 ExtRel   -2.6588  -17.1851 11.8675 False    2 Par       None known  -3.9932  -18.2611 10.2746 False 2 Par, ExtRel   >1 ExtRel   -1.2231  -9.2988   6.8526 False 2 Par, ExtRel   None known  -2.5576  -10.1587  5.0436 False  >1 ExtRel     None known  -1.3344  -4.7491   2.0803 False ------------------------------------------------------------               DRINKMO       AGE AAFAM2                           1 Par         7.035494 22.142857 2 Par         9.902778 21.523810 1 ExtRel      5.299743 22.070652 >1 ExtRel     7.243980 21.936795 1 Par, ExtRel 9.461473 22.086758 2 Par, ExtRel 8.467105 22.131579 None known    5.909539 21.992020                DRINKMO      AGE AAFAM2                           1 Par         18.178809 2.157548 2 Par         15.561367 1.721019 1 ExtRel      16.643207 1.956999 >1 ExtRel     24.958553 1.949725 1 Par, ExtRel 31.667958 1.995821 2 Par, ExtRel 19.649860 2.022158 None known    20.534125 2.010932 association between family history of alcoholism and drinks/month for those from 26-50                            OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.001 Model:                            OLS   Adj. R-squared:                  0.001 Method:                 Least Squares   F-statistic:                     3.404 Date:                Thu, 21 Feb 2019   Prob (F-statistic):            0.00234 Time:                        21:51:50   Log-Likelihood:                -61113. No. Observations:               13914   AIC:                         1.222e+05 Df Residuals:                   13907   BIC:                         1.223e+05 Df Model:                           6                                         Covariance Type:            nonrobust                                         ==============================================================================================                                 coef    std err          t      P>|t|      [0.025      0.975] ---------------------------------------------------------------------------------------------- Intercept                      8.3744      0.672     12.453      0.000       7.056       9.693 C(AAFAM2)[T.2 Par]             6.3325      2.214      2.860      0.004       1.993      10.672 C(AAFAM2)[T.1 ExtRel]         -0.9536      0.808     -1.180      0.238      -2.537       0.630 C(AAFAM2)[T.>1 ExtRel]        -1.9459      0.870     -2.237      0.025      -3.651      -0.241 C(AAFAM2)[T.1 Par, ExtRel]    -1.6592      0.838     -1.980      0.048      -3.302      -0.016 C(AAFAM2)[T.2 Par, ExtRel]    -1.7858      1.472     -1.213      0.225      -4.672       1.100 C(AAFAM2)[T.None known]       -1.5310      0.707     -2.166      0.030      -2.917      -0.145 ============================================================================== Omnibus:                    17507.488   Durbin-Watson:                   1.993 Prob(Omnibus):                  0.000   Jarque-Bera (JB):          3536212.632 Skew:                           6.858   Prob(JB):                         0.00 Kurtosis:                      79.886   Cond. No.                         16.8 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.     Multiple Comparison of Means - Tukey HSD,FWER=0.05     ============================================================    group1        group2    meandiff  lower    upper  reject ------------------------------------------------------------   1 ExtRel       1 Par      0.9536  -1.4286   3.3358 False   1 ExtRel   1 Par, ExtRel -0.7055  -2.6847   1.2736 False   1 ExtRel       2 Par      7.2861   0.9281  13.6441  True   1 ExtRel   2 Par, ExtRel -0.8322  -4.9139   3.2495 False   1 ExtRel     >1 ExtRel   -0.9923  -3.0868   1.1022 False   1 ExtRel     None known  -0.5774  -2.0455   0.8906 False    1 Par     1 Par, ExtRel -1.6592  -4.1303   0.812  False    1 Par         2 Par      6.3325  -0.1955  12.8604 False    1 Par     2 Par, ExtRel -1.7858  -6.1275   2.5558 False    1 Par       >1 ExtRel   -1.9459  -4.5104   0.6186 False    1 Par       None known   -1.531  -3.6155   0.5534 False 1 Par, ExtRel     2 Par      7.9916   1.5997  14.3835  True 1 Par, ExtRel 2 Par, ExtRel -0.1267  -4.2609   4.0076 False 1 Par, ExtRel   >1 ExtRel   -0.2867  -2.4819   1.9084 False 1 Par, ExtRel   None known   0.1281  -1.4803   1.7365 False    2 Par     2 Par, ExtRel -8.1183  -15.4395 -0.7971  True    2 Par       >1 ExtRel   -8.2784  -14.7069 -1.8498  True    2 Par       None known  -7.8635  -14.1161  -1.611  True 2 Par, ExtRel   >1 ExtRel    -0.16   -4.3508   4.0307 False 2 Par, ExtRel   None known   0.2548  -3.6606   4.1702 False  >1 ExtRel     None known   0.4148  -1.3336   2.1633 False ------------------------------------------------------------                DRINKMO       AGE AAFAM2                           1 Par          8.374409 38.841608 2 Par         14.706880 38.313953 1 ExtRel       7.420792 37.931378 >1 ExtRel      6.428524 37.107313 1 Par, ExtRel  6.715251 37.884314 2 Par, ExtRel  6.588565 38.354260 None known     6.843365 38.387373                DRINKMO      AGE AAFAM2                           1 Par         24.112929 6.921776 2 Par         41.859383 7.081478 1 ExtRel      21.230342 7.083139 >1 ExtRel     15.040908 7.103761 1 Par, ExtRel 19.075674 6.972548 2 Par, ExtRel 20.678330 6.942172 None known    18.917645 6.847600 association between family history of alcoholism and drinks/month for those over 50                            OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.001 Model:                            OLS   Adj. R-squared:                  0.001 Method:                 Least Squares   F-statistic:                     3.216 Date:                Thu, 21 Feb 2019   Prob (F-statistic):            0.00371 Time:                        21:51:50   Log-Likelihood:                -57808. No. Observations:               13358   AIC:                         1.156e+05 Df Residuals:                   13351   BIC:                         1.157e+05 Df Model:                           6                                         Covariance Type:            nonrobust                                         ==============================================================================================                                 coef    std err          t      P>|t|      [0.025      0.975] ---------------------------------------------------------------------------------------------- Intercept                      7.6742      0.638     12.021      0.000       6.423       8.926 C(AAFAM2)[T.2 Par]             2.6966      2.471      1.091      0.275      -2.147       7.540 C(AAFAM2)[T.1 ExtRel]         -0.8301      0.777     -1.068      0.286      -2.354       0.694 C(AAFAM2)[T.>1 ExtRel]        -1.7252      0.940     -1.836      0.066      -3.567       0.117 C(AAFAM2)[T.1 Par, ExtRel]    -0.8896      0.897     -0.992      0.321      -2.648       0.869 C(AAFAM2)[T.2 Par, ExtRel]     2.0869      1.869      1.117      0.264      -1.576       5.750 C(AAFAM2)[T.None known]       -1.8438      0.667     -2.766      0.006      -3.151      -0.537 ============================================================================== Omnibus:                    16026.030   Durbin-Watson:                   1.983 Prob(Omnibus):                  0.000   Jarque-Bera (JB):          2682859.110 Skew:                           6.338   Prob(JB):                         0.00 Kurtosis:                      71.261   Cond. No.                         19.9 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.     Multiple Comparison of Means - Tukey HSD,FWER=0.05     ============================================================    group1        group2    meandiff  lower    upper  reject ------------------------------------------------------------   1 ExtRel       1 Par      0.8301  -1.4621   3.1223 False   1 ExtRel   1 Par, ExtRel -0.0595  -2.3315   2.2125 False   1 ExtRel       2 Par      3.5267   -3.633  10.6864 False   1 ExtRel   2 Par, ExtRel  2.917   -2.4245   8.2585 False   1 ExtRel     >1 ExtRel   -0.8951  -3.3128   1.5227 False   1 ExtRel     None known  -1.0137  -2.4391   0.4117 False    1 Par     1 Par, ExtRel -0.8896  -3.5344   1.7552 False    1 Par         2 Par      2.6966   -4.59    9.9831 False    1 Par     2 Par, ExtRel  2.0869  -3.4235   7.5973 False    1 Par       >1 ExtRel   -1.7252  -4.4962   1.0459 False    1 Par       None known  -1.8438  -3.8097   0.1221 False 1 Par, ExtRel     2 Par      3.5862  -3.6941  10.8664 False 1 Par, ExtRel 2 Par, ExtRel  2.9765  -2.5255   8.4785 False 1 Par, ExtRel   >1 ExtRel   -0.8356  -3.5899   1.9188 False 1 Par, ExtRel   None known  -0.9542  -2.8966   0.9882 False    2 Par     2 Par, ExtRel -0.6097  -9.3487   8.1294 False    2 Par       >1 ExtRel   -4.4217  -11.7488  2.9053 False    2 Par       None known  -4.5404  -11.6024  2.5216 False 2 Par, ExtRel   >1 ExtRel   -3.8121  -9.3759   1.7517 False 2 Par, ExtRel   None known  -3.9307  -9.1405   1.2791 False  >1 ExtRel     None known  -0.1186  -2.2296   1.9923 False ------------------------------------------------------------                DRINKMO       AGE AAFAM2                           1 Par          7.674192 65.704242 2 Par         10.370763 63.050847 1 ExtRel       6.844085 66.106495 >1 ExtRel      5.949022 63.831683 1 Par, ExtRel  6.784583 61.682409 2 Par, ExtRel  9.761086 60.706422 None known     5.830390 68.811470                DRINKMO       AGE AAFAM2                           1 Par         21.722699 10.561760 2 Par         29.720298 11.193323 1 ExtRel      18.454500 10.870987 >1 ExtRel     15.015176 10.051776 1 Par, ExtRel 20.623544  8.663090 2 Par, ExtRel 37.674648  8.068346 None known    17.524113 11.430535
Tumblr media
25 and under
Tumblr media
26-50
Tumblr media
51 and over
Subdividing the data based upon age does skew the scenarios in which that we can reject the null hypothesis that there is no relationship between family history of alcoholism and average drinks/month.
For those under 25, the only group where the null hypothesis is rejected is between an alcoholic parent and some extended family vs those with no known family with alcoholism.
For those from 26-50, the null hypothesis is rejected between those with 2 alcoholic parents and all the groups but those with 1 alcoholic parent. 
For those that are over 50, despite the fact that the p-value is less than 0.05, the Tukey test for post hoc analysis does not find any groups that can reject the null hypothesis. This baffled me, but I found an explanation here.
To quote:
“It is possible that the overall mean of group A and group B combined differs significantly from the combined mean of groups C, D and E. Perhaps the mean of group A differs from the mean of groups B through E. Scheffe's post test detects differences like these (but this test is not offered by GraphPad InStat or Prism). If the overall ANOVA P value is less than 0.05, then Scheffe's test will definitely find a significant difference somewhere (if you look at the right comparison, also called contrast). The multiple comaprisons tests offered by GraphPad InStat and Prism only compare group means, and it is quite possible for the overall ANOVA to reject the null hypothesis that all group means are the same yet for the post test to find no significant difference among group means.”
This is likely what happened in this sample.
0 notes
kpenlearnspython-blog · 6 years ago
Text
Correlation Coefficiencts
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
Truthfully, my data doesn’t lend itself to two quantitative variables very well, so I’m going a off-topic and looking at the relationship between age and drinks/month.
All Python is blockquoted. Code is italicized. Output is not.
subaa1_dd=subaa1[["S2AQ8B", "DOBY", "DRINKMO"]].dropna()
max(subaa1_dd["DOBY"])
subaa1_dd["AGE"]=2002-subaa1_dd["DOBY"] #Study was published in 2002
subaa1_dd["AGE"].head(n=25)
plt.scatter(y="DRINKMO", x="AGE", data=subaa1_dd) plt.xlabel("Age") plt.ylabel("Drinks/month")
print("Association between number of number of drinks/month and age") print(sst.pearsonr(subaa1_dd["AGE"], subaa1_dd["DRINKMO"]))
Association between number of number of drinks/month and age (-0.025620117177638197, 6.447570986056389e-06)
Tumblr media
The p-value is small (6.45e-6) to suggest that the relationship is statistically significant. The correlation coefficient (r=-0.0256) is close to 0 and suggests a very weak negative correlation. This suggests that as age increases, there is a weak decrease in the amount of drinks/month, though it’s almost negligible.
If I perform r^2, I get 0.0006563904041959118. As this is extremely small, it tells me that age and drinks/month are not good proxies to predict one another.
0 notes
kpenlearnspython-blog · 6 years ago
Text
Chi Square Test
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
2/11/19 edits: I realized I misunderstood the Bonferroni adjustment upon posting yesterday. Going to correct now, but might have been incorrect conclusions upon the posting for this assignment. Using strikethrough to denote previously wrong assessment.
I tried to blockquote all Python. Written code is italicized. Printed code is not.
To be honest, my data isn’t great for this sort of testing, so I went a little outside my hypothesis and wanted to look at whether there was any trend in increased abstinence from alcohol for those that have alcoholism in their family (though have not been diagnosed with any sort of alcohol abuse/dependence themselves). 
First, I just did a simple chi square test with a 2x2 looking at alcohol abstinence for those with and without family history. 
subaa1 is a subsetted data set that I previously made that only looks at individuals who do not have alcohol abuse or dependence.
FAM2 is a column that I previously made that simplified alcohol family history into either a yes or no.
subaa1["S2AQ1"]=subaa1["S2AQ1"].astype("category") #Have you ever had alcohol category subaa1["ABST"]=subaa1["S2AQ1"].cat.rename_categories(["Drinks", "Abstains"]) ct1=pd.crosstab(subaa1["ABST"], subaa1["FAM2"]) #categorical variables print(ct1) #get counts colsum=ct1.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct=ct1/colsum print(colpct)
print("chi-square value, p value, expected counts") cs1=sst.chi2_contingency(ct1) print(cs1)
FAM2      Family History  No Family History ABST                                       Abstains            2380               5886 Drinks              9494              13490 FAM2      Family History  No Family History ABST                                       Abstains        0.200438           0.303778 Drinks          0.799562           0.696222 chi-square value, p value, expected counts (403.6040870866473, 9.04438165422264e-90, 1, array([[ 3140.815488,  5125.184512],       [ 8733.184512, 14250.815488]]))
The chi-square value (403.6) is much greater than 3.84, and the p-value (9.0e-90) is much less that 0.05, so I can reject the null hypothesis that there is no correlation between family history of alcoholism and whether a person drinks or abstains. From the table, it appears that those with a family history of alcoholism are more likely to drink than expected (79.9% obtained vs 73.5% expected).
To do a post hoc test, I decided to look at my categories for family history with alcoholism (1 parent, 2 parents, 1 extended relative, >1 extended relative, 1 parent+extended relatives, 2 parents + extended relative, no alcoholic family known) in relation to drinking vs abstaining.
ct2=pd.crosstab(subaa1["ABST"], subaa1["AAFAM2"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) #7 degrees of freedom
print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2) print("Expected chi-square for 7 degrees of freedom is 14.07.") print("Corrected p-value for 20 comparisons") 0.05/20
AAFAM2    1 Par  2 Par  1 ExtRel  >1 ExtRel  1 Par, ExtRel  2 Par, ExtRel  \ ABST                                                                         Abstains    434     36       896        467            479             68   Drinks     1437    132      3285       1953           2345            342   AAFAM2    None known   ABST                   Abstains        5886   Drinks         13490   AAFAM2      1 Par    2 Par  1 ExtRel  >1 ExtRel  1 Par, ExtRel  2 Par, ExtRel  \ ABST                                                                             Abstains 0.231962 0.214286  0.214303   0.192975       0.169618       0.165854   Drinks   0.768038 0.785714  0.785697   0.807025       0.830382       0.834146   AAFAM2    None known   ABST                   Abstains    0.303778   Drinks      0.696222   chi-square value, p value, expected counts (434.99112479242615, 8.331856012499931e-91, 6, array([[  494.901952,    44.438016,  1105.924672,   640.11904 ,          746.981888,   108.44992 ,  5125.184512],       [ 1376.098048,   123.561984,  3075.075328,  1779.88096 ,         2077.018112,   301.55008 , 14250.815488]])) Expected chi-square for 7 degrees of freedom is 14.07. Corrected p-value for 20 comparisons Out[39]: 0.0025
Considering my expected chi-square (according to google) is 14.07, and I got 434.99, I can safely reject the null hypothesis that there is no correlation between specific family with alcoholism and alcohol abstinence.
I further did post-hoc analysis to find the groups with significant differences. Am looking for the Bonferroni adjusted p-value of 0.0025.
This is about 20 comparisons, so I will summarize the ones with significant differences here. For those numbered below, I can reject the null hypothesis that there is no difference in alcohol abstinence for their family history.
1 Parent vs. >1 Extended Relative
1 Parent vs. 1 Parent + Extended Relatives
1 Parent vs 2 Parents + Extended Relatives
1 Parent vs None known
2 Parents vs. None known
1 Extended Relative vs. >1 Extended Relative
1 Extended Relative vs. 1 Parent + Extended Relatives
1 Extended Relative vs. 2 Parents + Extended Relatives
1 Extended Relative vs. None known
1 Parent + Extended Relatives vs. >1 Extended Relative
>1 Extended Relative vs. None known
1 Parent + Extended Relatives vs. None known
2 Parents + Extended Relatives vs. None known
FAMCOMPv1  1 Par  2 Par ABST                   Abstains     434     36 Drinks      1437    132 FAMCOMPv1    1 Par    2 Par ABST                       Abstains  0.231962 0.214286 Drinks    0.768038 0.785714 chi-square value, p value, expected counts (0.18103193164993958, 0.6704879106105417, 1, array([[ 431.27513487,   38.72486513],       [1439.72486513,  129.27513487]])) FAMCOMPv2  1 ExtRel  1 Par ABST                       Abstains        896    434 Drinks         3285   1437 FAMCOMPv2  1 ExtRel    1 Par ABST                         Abstains   0.214303 0.231962 Drinks     0.785697 0.768038 chi-square value, p value, expected counts (2.2488225420879204, 0.13371611345920698, 1, array([[ 918.82518176,  411.17481824],       [3262.17481824, 1459.82518176]])) FAMCOMPv3  1 Par  >1 ExtRel ABST                       Abstains     434        467 Drinks      1437       1953 FAMCOMPv3    1 Par  >1 ExtRel ABST                         Abstains  0.231962   0.192975 Drinks    0.768038   0.807025 chi-square value, p value, expected counts (9.434649397276742, 0.002129237910827844, 1, array([[ 392.86203682,  508.13796318],       [1478.13796318, 1911.86203682]])) FAMCOMPv4  1 Par  1 Par, ExtRel ABST                           Abstains     434            479 Drinks      1437           2345 FAMCOMPv4    1 Par  1 Par, ExtRel ABST                             Abstains  0.231962       0.169618 Drinks    0.768038       0.830382 chi-square value, p value, expected counts (27.52696632852945, 1.5491937328656087e-07, 1, array([[ 363.83876464,  549.16123536],       [1507.16123536, 2274.83876464]])) FAMCOMPv5  1 Par  2 Par, ExtRel ABST                           Abstains     434             68 Drinks      1437            342 FAMCOMPv5    1 Par  2 Par, ExtRel ABST                             Abstains  0.231962       0.165854 Drinks    0.768038       0.834146 chi-square value, p value, expected counts (8.181860997484213, 0.004231133032437819, 1, array([[ 411.76764577,   90.23235423],       [1459.23235423,  319.76764577]])) FAMCOMPv6  1 Par  None known ABST                         Abstains     434        5886 Drinks      1437       13490 FAMCOMPv6    1 Par  None known ABST                           Abstains  0.231962    0.303778 Drinks    0.768038    0.696222 chi-square value, p value, expected counts (41.76775439560775, 1.027846499574022e-10, 1, array([[  556.53598155,  5763.46401845],       [ 1314.46401845, 13612.53598155]])) FAMCOMPv7  2 Par  None known ABST                         Abstains      36        5886 Drinks       132       13490 FAMCOMPv7    2 Par  None known ABST                           Abstains  0.214286    0.303778 Drinks    0.785714    0.696222 chi-square value, p value, expected counts (5.899442601003834, 0.015145677174912012, 1, array([[   50.90544413,  5871.09455587],       [  117.09455587, 13504.90544413]])) FAMCOMPv8  2 Par  2 Par, ExtRel ABST                           Abstains      36             68 Drinks       132            342 FAMCOMPv8    2 Par  2 Par, ExtRel ABST                             Abstains  0.214286       0.165854 Drinks    0.785714       0.834146 chi-square value, p value, expected counts (1.5804032985363066, 0.20870261116515335, 1, array([[ 30.2283737,  73.7716263],       [137.7716263, 336.2283737]])) FAMCOMPv9  1 Par, ExtRel  2 Par ABST                           Abstains             479     36 Drinks              2345    132 FAMCOMPv9  1 Par, ExtRel    2 Par ABST                             Abstains        0.169618 0.214286 Drinks          0.830382 0.785714 chi-square value, p value, expected counts (1.9178315136173767, 0.16609591015709457, 1, array([[ 486.0828877,   28.9171123],       [2337.9171123,  139.0828877]])) FAMCOMPv10  2 Par  >1 ExtRel ABST                         Abstains       36        467 Drinks        132       1953 FAMCOMPv10    2 Par  >1 ExtRel ABST                           Abstains   0.214286   0.192975 Drinks     0.785714   0.807025 chi-square value, p value, expected counts (0.32968603816180103, 0.5658440186807167, 1, array([[  32.65224111,  470.34775889],       [ 135.34775889, 1949.65224111]])) FAMCOMPv11  1 ExtRel  2 Par ABST                       Abstains         896     36 Drinks          3285    132 FAMCOMPv11  1 ExtRel    2 Par ABST                         Abstains    0.214303 0.214286 Drinks      0.785697 0.785714 chi-square value, p value, expected counts (0.009091829850638621, 0.9240359656620418, 1, array([[ 895.99724074,   36.00275926],       [3285.00275926,  131.99724074]])) FAMCOMPv12  1 ExtRel  >1 ExtRel ABST                           Abstains         896        467 Drinks          3285       1953 FAMCOMPv12  1 ExtRel  >1 ExtRel ABST                           Abstains    0.214303   0.192975 Drinks      0.785697   0.807025 chi-square value, p value, expected counts (4.126102965534052, 0.04222648604697947, 1, array([[ 863.30904408,  499.69095592],       [3317.69095592, 1920.30904408]])) FAMCOMPv13  1 ExtRel  1 Par, ExtRel ABST                               Abstains         896            479 Drinks          3285           2345 FAMCOMPv13  1 ExtRel  1 Par, ExtRel ABST                               Abstains    0.214303       0.169618 Drinks      0.785697       0.830382 chi-square value, p value, expected counts (21.051577696848984, 4.470845718926547e-06, 1, array([[ 820.68165596,  554.31834404],       [3360.31834404, 2269.68165596]])) FAMCOMPv14  1 ExtRel  2 Par, ExtRel ABST                               Abstains         896             68 Drinks          3285            342 FAMCOMPv14  1 ExtRel  2 Par, ExtRel ABST                               Abstains    0.214303       0.165854 Drinks      0.785697       0.834146 chi-square value, p value, expected counts (4.995438947932991, 0.02541420669226183, 1, array([[ 877.90982357,   86.09017643],       [3303.09017643,  323.90982357]])) FAMCOMPv15  1 ExtRel  None known ABST                             Abstains         896        5886 Drinks          3285       13490 FAMCOMPv15  1 ExtRel  None known ABST                             Abstains    0.214303    0.303778 Drinks      0.785697    0.696222 chi-square value, p value, expected counts (133.85527554859885, 5.876686334055551e-31, 1, array([[ 1203.69919769,  5578.30080231],       [ 2977.30080231, 13797.69919769]])) FAMCOMPv15  1 Par, ExtRel  >1 ExtRel ABST                                 Abstains              479        467 Drinks               2345       1953 FAMCOMPv15  1 Par, ExtRel  >1 ExtRel ABST                                 Abstains         0.169618   0.192975 Drinks           0.830382   0.807025 chi-square value, p value, expected counts (4.652191417597543, 0.03101393014512997, 1, array([[ 509.44012204,  436.55987796],       [2314.55987796, 1983.44012204]])) FAMCOMPv16  2 Par, ExtRel  >1 ExtRel ABST                                 Abstains               68        467 Drinks                342       1953 FAMCOMPv16  2 Par, ExtRel  >1 ExtRel ABST                                 Abstains         0.165854   0.192975 Drinks           0.834146   0.807025 chi-square value, p value, expected counts (1.5099437664564455, 0.2191476673124011, 1, array([[  77.50883392,  457.49116608],       [ 332.49116608, 1962.50883392]])) FAMCOMPv17  >1 ExtRel  None known ABST                             Abstains          467        5886 Drinks           1953       13490 FAMCOMPv17  >1 ExtRel  None known ABST                             Abstains     0.192975    0.303778 Drinks       0.807025    0.696222 chi-square value, p value, expected counts (127.35685258994046, 1.5520087618161547e-29, 1, array([[  705.37071022,  5647.62928978],       [ 1714.62928978, 13728.37071022]])) FAMCOMPv18  1 Par, ExtRel  None known ABST                                 Abstains              479        5886 Drinks               2345       13490 FAMCOMPv18  1 Par, ExtRel  None known ABST                                 Abstains         0.169618    0.303778 Drinks           0.830382    0.696222 chi-square value, p value, expected counts (216.27137431045233, 5.884680839320336e-49, 1, array([[  809.67387387,  5555.32612613],       [ 2014.32612613, 13820.67387387]])) FAMCOMPv19  1 Par, ExtRel  2 Par, ExtRel ABST                                     Abstains              479             68 Drinks               2345            342 FAMCOMPv19  1 Par, ExtRel  2 Par, ExtRel ABST                                     Abstains         0.169618       0.165854 Drinks           0.830382       0.834146 chi-square value, p value, expected counts (0.014277578109831067, 0.9048880969104677, 1, array([[ 477.6524428,   69.3475572],       [2346.3475572,  340.6524428]])) FAMCOMPv20  2 Par, ExtRel  None known ABST                                 Abstains               68        5886 Drinks                342       13490 FAMCOMPv20  2 Par, ExtRel  None known ABST                                 Abstains         0.165854    0.303778 Drinks           0.834146    0.696222 chi-square value, p value, expected counts (35.65456040933068, 2.355957096167399e-09, 1, array([[  123.37713535,  5830.62286465],       [  286.62286465, 13545.37713535]]))
Full code below if you click “Read More”. It’s a lot and it’s repetitive.
recode2={"1 Par":"1 Par", "2 Par":"2 Par"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv1']=subaa1['AAFAM2'].map(recode2) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv1"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode3={"1 Par":"1 Par", "1 ExtRel":"1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv2']=subaa1['AAFAM2'].map(recode3) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv2"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode4={"1 Par":"1 Par", ">1 ExtRel":">1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv3']=subaa1['AAFAM2'].map(recode4) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv3"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode5={"1 Par":"1 Par", "1 Par, ExtRel":"1 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv4']=subaa1['AAFAM2'].map(recode5) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv4"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode6={"1 Par":"1 Par", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv5']=subaa1['AAFAM2'].map(recode6) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv5"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode7={"1 Par":"1 Par", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv6']=subaa1['AAFAM2'].map(recode7) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv6"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode8={"2 Par":"2 Par", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv7']=subaa1['AAFAM2'].map(recode8) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv7"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode9={"2 Par":"2 Par", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv8']=subaa1['AAFAM2'].map(recode9) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv8"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode10={"2 Par":"2 Par", "1 Par, ExtRel":"1 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv9']=subaa1['AAFAM2'].map(recode10) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv9"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode11={"2 Par":"2 Par", ">1 ExtRel":">1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv10']=subaa1['AAFAM2'].map(recode11) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv10"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode12={"2 Par":"2 Par", "1 ExtRel":"1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv11']=subaa1['AAFAM2'].map(recode12) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv11"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode13={"1 ExtRel":"1 ExtRel", ">1 ExtRel":">1 ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv12']=subaa1['AAFAM2'].map(recode13) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv12"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode14={"1 ExtRel":"1 ExtRel", "1 Par, ExtRel":"1 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv13']=subaa1['AAFAM2'].map(recode14) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv13"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode15={"1 ExtRel":"1 ExtRel", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv14']=subaa1['AAFAM2'].map(recode15) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv14"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode16={"1 ExtRel":"1 ExtRel", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv15']=subaa1['AAFAM2'].map(recode16) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv15"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode16={">1 ExtRel":">1 ExtRel", "1 Par, ExtRel":"1 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv15']=subaa1['AAFAM2'].map(recode16) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv15"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode17={">1 ExtRel":">1 ExtRel", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv16']=subaa1['AAFAM2'].map(recode17) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv16"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode18={">1 ExtRel":">1 ExtRel", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv17']=subaa1['AAFAM2'].map(recode18) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv17"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode19={"1 Par, ExtRel":"1 Par, ExtRel", "None known":"None known"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv18']=subaa1['AAFAM2'].map(recode19) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv18"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode20={"1 Par, ExtRel":"1 Par, ExtRel", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv19']=subaa1['AAFAM2'].map(recode20) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv19"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
recode21={"None known":"None known", "2 Par, ExtRel":"2 Par, ExtRel"} #keeping 2 values but exclude other values in variable subaa1['FAMCOMPv20']=subaa1['AAFAM2'].map(recode21) ct2=pd.crosstab(subaa1["ABST"], subaa1["FAMCOMPv20"]) #categorical variables print(ct2) #get counts colsum2=ct2.sum(axis=0)#use counts from crosstab table. axis=0 says to sum all values in each column #axis=0 means columns. axis=1 means rows colpct2=ct2/colsum2 print(colpct2) print("chi-square value, p value, expected counts") cs2=sst.chi2_contingency(ct2) print(cs2)
0 notes
kpenlearnspython-blog · 6 years ago
Text
Data Analysis Tools: W1
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
I tried to blockquote all Python. Written code is italicized. Printed code is not.
I used the number of drinks per month and category of family members of alcoholism for my data set. As you can see in the last post, it looked like people who had 2 parents who were alcoholics had more drinks/month, but the variance was high. Previously, I have subsetted my data to only look at people who have not displayed signs of alcohol abuse/dependence.
subaa12=subaa1[["DRINKMO", "AAFAM2"]].dropna()
aafammodel = smf.ols(formula="DRINKMO ~ C(AAFAM2)", data=subaa12) aafamresults=aafammodel.fit() print(aafamresults.summary()) #F-statistic of 3.82e-06, so something is significantly different.
mc_aafam = multi.MultiComparison(subaa12["DRINKMO"], subaa12["AAFAM2"]) res_aafam = mc_aafam.tukeyhsd() #Request the test print(res_aafam.summary())
                           OLS Regression Results                             ============================================================================== Dep. Variable:                DRINKMO   R-squared:                       0.001 Model:                            OLS   Adj. R-squared:                  0.001 Method:                 Least Squares   F-statistic:                     5.882 Date:                Sat, 02 Feb 2019   Prob (F-statistic):           3.82e-06 Time:                        20:30:58   Log-Likelihood:            -1.3584e+05 No. Observations:               30996   AIC:                         2.717e+05 Df Residuals:                   30989   BIC:                         2.718e+05 Df Model:                           6                                         Covariance Type:            nonrobust                                         ==============================================================================================                                 coef    std err          t      P>|t|      [0.025      0.975] ---------------------------------------------------------------------------------------------- Intercept                      7.9278      0.449     17.650      0.000       7.047       8.808 C(AAFAM2)[T.2 Par]             4.6302      1.569      2.951      0.003       1.555       7.706 C(AAFAM2)[T.1 ExtRel]         -1.0241      0.540     -1.896      0.058      -2.083       0.035 C(AAFAM2)[T.>1 ExtRel]        -1.4900      0.598     -2.492      0.013      -2.662      -0.318 C(AAFAM2)[T.1 Par, ExtRel]    -0.7644      0.579     -1.321      0.187      -1.899       0.370 C(AAFAM2)[T.2 Par, ExtRel]    -0.1417      1.059     -0.134      0.894      -2.217       1.934 C(AAFAM2)[T.None known]       -1.6631      0.470     -3.535      0.000      -2.585      -0.741 ============================================================================== Omnibus:                    40095.815   Durbin-Watson:                   1.986 Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10245364.757 Skew:                           7.153   Prob(JB):                         0.00 Kurtosis:                      90.911   Cond. No.                         18.0 ==============================================================================Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.     Multiple Comparison of Means - Tukey HSD,FWER=0.05     ============================================================    group1        group2    meandiff  lower    upper  reject ------------------------------------------------------------   1 ExtRel       1 Par      1.0241  -0.5684   2.6167 False   1 ExtRel   1 Par, ExtRel  0.2597  -1.1335   1.653  False   1 ExtRel       2 Par      5.6543   1.1339  10.1748  True   1 ExtRel   2 Par, ExtRel  0.8824  -2.0804   3.8451 False   1 ExtRel     >1 ExtRel   -0.4659  -1.9278   0.996  False   1 ExtRel     None known   -0.639  -1.6149   0.337  False    1 Par     1 Par, ExtRel -0.7644  -2.4711   0.9423 False    1 Par         2 Par      4.6302   0.0036   9.2569  True    1 Par     2 Par, ExtRel -0.1417  -3.2642   2.9807 False    1 Par       >1 ExtRel    -1.49   -3.2532   0.2731 False    1 Par       None known  -1.6631  -3.0502   -0.276  True 1 Par, ExtRel     2 Par      5.3946   0.8327   9.9565  True 1 Par, ExtRel 2 Par, ExtRel  0.6226   -2.403   3.6483 False 1 Par, ExtRel   >1 ExtRel   -0.7257  -2.3111   0.8598 False 1 Par, ExtRel   None known  -0.8987  -2.0516   0.2541 False    2 Par     2 Par, ExtRel -4.7719   -10.03   0.4862 False    2 Par       >1 ExtRel   -6.1202  -10.7035 -1.5369  True    2 Par       None known  -6.2933  -10.7455 -1.8411  True 2 Par, ExtRel   >1 ExtRel   -1.3483  -4.4061   1.7096 False 2 Par, ExtRel   None known  -1.5213  -4.3789   1.3362 False  >1 ExtRel     None known  -0.1731  -1.4079   1.0618 False ------------------------------------------------------------
Despite what looked like high error bars, the data first gives a low F-statistic of  3.82e-06, indicating that some data does allow for rejection of the null hypothesis. When doing the Tukey test to determine which sets I can reject the null hypothesis (that there is no difference in number of drinks per month between people with different alcoholic family histories), it indicates:
1 Parent and 2 Parents
1 Parent and None known
1 Parent+Extended Relatives and 2 Parents
2 Parents and Multiple Extended Relatives
2 Parents and None known
Noticeably, I only got this result when I used .dropna() and I haven’t figured out why yet. I’m going to do some googling and review the lesson.
0 notes
kpenlearnspython-blog · 6 years ago
Text
Week 4: Graph Time
Primary Research Question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Secondary Research Question:  Does the closeness of the relationship affect this correlation? (Parent vs more distant relation)
I tried to blockquote all Python. Written code is italicized. Printed code is not.
For my univariate graphs, I first wanted to look at the number of people with no reported alcohol problem that have family history of alcoholism.
import seaborn as sb import matplotlib.pyplot as plt #seaborn is dependent upon this to create graphs#Create univariate graphs to show center and spread. #This is just general counts. #Will do number of people with no reported alcohol problem that have family history of alcoholism. subaa = nesarc[(nesarc['ALCABDEP12DX']==0) & (nesarc['ALCABDEPP12DX']==0)] #No history of alcohol abuse subgroup
subaa1=subaa.copy() #Now need to make a single variable for those with family history. subaa1['S2DQ1'] = pd.to_numeric(subaa1['S2DQ1']) #Father subaa1['S2DQ2'] = pd.to_numeric(subaa1['S2DQ2']) #Mother subaa1['S2DQ7C2'] = pd.to_numeric(subaa1['S2DQ7C2']) #Dad's bro subaa1['S2DQ8C2'] = pd.to_numeric(subaa1['S2DQ8C2']) #Dad's sis subaa1['S2DQ9C2'] = pd.to_numeric(subaa1['S2DQ9C2']) #Mom's bro subaa1['S2DQ10C2'] = pd.to_numeric(subaa1['S2DQ10C2']) #Mom's sis subaa1['S2DQ11'] = pd.to_numeric(subaa1['S2DQ11']) #Dad's pa subaa1['S2DQ12'] = pd.to_numeric(subaa1['S2DQ12']) #Dad's ma subaa1['S2DQ13A'] = pd.to_numeric(subaa1['S2DQ13A']) #Mom's pa subaa1['S2DQ13B'] = pd.to_numeric(subaa1['S2DQ13B']) #Mom's ma
#Replace all "nos" to "0"s subaa1['S2DQ1'] = subaa1['S2DQ1'].replace([2], 0) #Dad subaa1['S2DQ2'] = subaa1['S2DQ2'].replace([2], 0) #Mom subaa1['S2DQ7C2'] = subaa1['S2DQ7C2'].replace([2], 0) #D-bro subaa1['S2DQ8C2'] = subaa1['S2DQ8C2'].replace([2], 0) #D-sis subaa1['S2DQ9C2'] = subaa1['S2DQ9C2'].replace([2], 0) #M-bro subaa1['S2DQ10C2'] = subaa1['S2DQ10C2'].replace([2], 0) #M-sis subaa1['S2DQ11'] = subaa1['S2DQ11'].replace([2], 0) #D-pa subaa1['S2DQ12'] = subaa1['S2DQ12'].replace([2], 0) #D-ma subaa1['S2DQ13A'] = subaa1['S2DQ13A'].replace([2], 0) #M-pa subaa1['S2DQ13B'] = subaa1['S2DQ13B'].replace([2], 0) #M-ma
subaa1['S2DQ1'] = subaa1['S2DQ1'].replace([9], np.nan) #Dad subaa1['S2DQ2'] = subaa1['S2DQ2'].replace([9], np.nan) #Mom subaa1['S2DQ7C2'] = subaa1['S2DQ7C2'].replace([9], np.nan) #D-bro subaa1['S2DQ8C2'] = subaa1['S2DQ8C2'].replace([9], np.nan) #D-sis subaa1['S2DQ9C2'] = subaa1['S2DQ9C2'].replace([9], np.nan) #M-bro subaa1['S2DQ10C2'] = subaa1['S2DQ10C2'].replace([9], np.nan) #M-sis subaa1['S2DQ11'] = subaa1['S2DQ11'].replace([9], np.nan) #D-pa subaa1['S2DQ12'] = subaa1['S2DQ12'].replace([9], np.nan) #D-ma subaa1['S2DQ13A'] = subaa1['S2DQ13A'].replace([9], np.nan) #M-pa subaa1['S2DQ13B'] = subaa1['S2DQ13B'].replace([9], np.nan) #M-ma
def FAM(row):    if row["S2DQ1"]>0 or row["S2DQ2"]>0 or row["S2DQ7C2"]>0 or row["S2DQ8C2"]>0 or row["S2DQ9C2"]>0 or row["S2DQ10C2"]>0 or row["S2DQ11"]>0 or row["S2DQ12"]>0 or row["S2DQ13A"]>0 or row["S2DQ13B"]>0:        return 1    else:        return 0
subaa1["FAM"] = subaa1.apply(lambda row: FAM(row), axis=1) subaa12 = subaa1[["FAM", "S2DQ1", "S2DQ2", "S2DQ7C2", "S2DQ8C2", "S2DQ9C2", "S2DQ10C2", "S2DQ11", "S2DQ12", "S2DQ13A", "S2DQ13B"]] subaa12.head(n=25)
c11=subaa1["FAM"].value_counts(sort=False, dropna=False) p11=subaa1["FAM"].value_counts(sort=False, dropna=False, normalize=True) print("Family history of alcohol abuse or dependence--1 is yes, 0 is no") print(c11)
subaa1["FAM"] = subaa1["FAM"].astype('category') sb.countplot(x="FAM", data=subaa1) plt.xlabel("Presence of alcohol abuse or dependence in previous generation") plt.title("Alcoholism in family history for individuals with no personal history of alcohol abuse or dependence")
Family history of alcohol abuse or dependence--1 is yes, 0 is no 0    19376 1    11874 Name: FAM, dtype: int64
Tumblr media
So, the majority of people who do not presently exhibit alcohol abuse or dependence have no family history of alcoholism.
I then wanted to look at histograms of drinks consumed by those not reported to exhibit alcohol abuse/dependence for those with and without family history. I used previously created sub-datasets to visualize this.
sb.distplot(subaafam2["DRINKMO"].dropna(), kde=False); plt.xlabel("Number of Drinks Per Month") plt.title("Estimated # of Drinks/Month for those with Family History of Alcoholism")
sb.distplot(subaafam4["DRINKMO"].dropna(), kde=False); plt.xlabel("Number of Drinks Per Month") plt.title("Estimated # of Drinks/Month for those with NO Family History of Alcoholism")
Tumblr media Tumblr media
For both of these data sets, as I have seen in previous counts, the majority of people consume very few drinks, though there are some high outliers. This makes this histogram on the whole fairly uninformative.
Next, I wanted to examine a bivariate relationship between the average number of drinks per month and family history of alcoholism.
subaa1['S2AQ8A'] = pd.to_numeric(subaa1['S2AQ8A'],errors='coerce').fillna(0).astype(int) subaa1.loc[(subaa1["S2AQ3"]!=9) & (subaa1["S2AQ8A"]==0), "S2AQ8A"]=11 subaa1["USFREQ"]=subaa1["S2AQ8A"].map(recode1) subaa1['S2AQ8B'] = pd.to_numeric(subaa1['S2AQ8B'],errors='coerce').fillna(0).astype(int) subaa1['S2AQ8B']=subaa1['S2AQ8B'].replace(99, np.nan) #print(subaafam4["S2AQ8B"].value_counts(sort=False, dropna=False))#Just to make sure this worked subaa1["DRINKMO"]=subaa1["S2AQ8B"]*subaa1["USFREQ"]/12 print(max(subaa1["DRINKMO"])) subaa1['DRINKMO10']=pd.qcut(subaa1.DRINKMO, 10, duplicates="drop", labels=["1=50%tile", "2=60%tile", "3=70%tile", "4=80%tile", "5=90%tile", "6=100%tile"]) subaa1["DRINKMO10"].head(n=25)
subaa1["FAM2"]=subaa1["FAM"]
subaa1["FAM2"]=subaa1["FAM2"].cat.rename_categories(["No Family History", "Family History"])
sb.factorplot(x='FAM2', y='DRINKMO', data=subaa1, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Average of Drinks/Month')
Tumblr media
Interestingly, the average number of drinks/month is slightly higher for those with family history of alcoholism (though I’m not sure if this is statistically significant).
Finally, I wanted to explore if there was any relationship in the number of drinks/month and alcoholic family history. Specifically, if degree of relationship correlated with drink consumption.
parents = np.dstack((subaa1["S2DQ1"],subaa1["S2DQ2"])) #making variables of interest into a single array print(parents.shape) #double checking (1, 31250, 2)
parents2 = np.nansum(parents,2) #adding the two elements together on the correct dimension print(parents2.T.shape) (31250, 1)
subaa1["AAFAMPAR2"]=parents2.T #adding the column into the dataset as a new variable extfam = np.dstack((subaa1["S2DQ7C2"],subaa1["S2DQ8C2"],subaa1["S2DQ9C2"],subaa1["S2DQ10C2"],subaa1["S2DQ11"],subaa1["S2DQ12"],subaa1["S2DQ13A"],subaa1["S2DQ13B"])) print(extfam.shape) (1, 31250, 8)
extfam2=np.nansum(extfam, 2) print(extfam2.shape) (1, 31250)
subaa1["AAFAMEXT2"]=extfam2.Tdef AAFAM2(row):    if row["AAFAMPAR2"]==1 and row["AAFAMEXT2"]==0:        return 1 #1 alcoholic parent    if row["AAFAMPAR2"]==2 and row["AAFAMEXT2"]==0:        return 2 #both alcoholic parents    if row["AAFAMPAR2"]==0 and row["AAFAMEXT2"]==1:        return 3 #one alcoholic relative    if row["AAFAMPAR2"]==0 and row["AAFAMEXT2"]>1:        return 4 #multiple alcoholic relatives    if row["AAFAMPAR2"]==1 and row["AAFAMEXT2"]>0:        return 5 #One alcoholic parent and at least 1 alcoholic relative    if row["AAFAMPAR2"]>1 and row["AAFAMEXT2"]>0:        return 6 #Both parents and at least 1 relative    if row["AAFAMPAR2"]==0 and row["AAFAMEXT2"]==0:        return 7 #No known alcoholic family historyprint("For those with family history of alcohol abuse/dependece, but no personal history, I wanted to look at the distribution of family with alcoholism (parents vs extended family)" + "\n"      "Number code:" + "\n"      "1. A single alcoholic parent" + "\n"      "2. Two alcoholic parents" + "\n"      "3. One alcoholic extended relative" + "\n"      "4. Multiple alcoholic extended relatives" + "\n"      "5. One alcoholic parents and at least one alcoholic relative" + "\n"      "6. Both parents and at least one extended relative" + "\n"      "7. No known alcoholic family history") subaa1["AAFAM2"] = subaa1.apply(lambda row: AAFAM2(row), axis=1) c13=subaa1["AAFAM2"].value_counts(sort=False, dropna=False) p13=subaa1["AAFAM2"].value_counts(sort=False, dropna=False, normalize=True) print("Percentages") print(p13.sort_index()) print("Counts") print(c13.sort_index()) For those with family history of alcohol abuse/dependece, but no personal history, I wanted to look at the distribution of family with alcoholism (parents vs extended family) Number code: 1. A single alcoholic parent 2. Two alcoholic parents 3. One alcoholic extended relative 4. Multiple alcoholic extended relatives 5. One alcoholic parents and at least one alcoholic relative 6. Both parents and at least one extended relative 7. No known alcoholic family history Percentages 1   0.059872 2   0.005376 3   0.133792 4   0.077440 5   0.090368 6   0.013120 7   0.620032 Name: AAFAM2, dtype: float64 Counts 1     1871 2      168 3     4181 4     2420 5     2824 6      410 7    19376 Name: AAFAM2, dtype: int64
subaa1["AAFAM2"]=subaa1["AAFAM2"].astype("category") subaa1["AAFAM2"]=subaa1["AAFAM2"].cat.rename_categories(["1 Par", "2 Par", "1 ExtRel", ">1 ExtRel", "1 Par, ExtRel", "2 Par, ExtRel", "None known"])
sb.factorplot(x='AAFAM2', y='DRINKMO', data=subaa1, kind="bar") plt.xlabel('Family Alcoholism') plt.ylabel('Avg of Drinks/Month') plt.xticks(rotation = 45)
Tumblr media
Interestingly, those with two parents seem to have a slightly average of drinks/month, though the variance on this large, so it’s hard to know if it’s statistically significant. Worth noting, we don’t see the same average for those that have 2 parents and extended relatives with alcoholism.
Summary:
More people that have no personal history of alcohol abuse or dependence also do not have family history of alcohol abuse/dependence than those that do have family history.
When looking at counts of average drinks/month for those with and without family history, the vast majority of people in both data sets consume fairly few drinks.
Despite this, if you look average numbers of drinks/month for those with and without family history, those with family history have a slightly higher average. Not sure if this is statistically significant or not.
Furthermore, if you look at the relationship of alcoholic family members to alcohol consumption for those that have no history of alcohol abuse/dependence, you see a higher rate of consumption for those with 2 parents. That being said, the variance is high here, so it is unclear if it is statistically significant.
0 notes
kpenlearnspython-blog · 6 years ago
Text
Week 3: Python Hell
Primary research question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
I changed my statistic to drinks/month in this, though I do get that distribution here.
Secondary research question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation, present vs. absent alcoholic parent)
Full disclosure--I haven’t touched the present vs. absent thing yet.
This week I try to better sort my data, and, quite frankly, I’m not sure if I did everything right. But I tried. I am beginning to think that I made things too complicated for myself. Too late now. Code is blockquoted, with raw code in italics and normal letters being the output.
First, I need to re-do the grouping for the abstainers vs. did not answer. I realized in watching the lesson videos that those were not exactly the same.
#subaafam2 is for those that have an alcoholic family history. #subaafam4 is for those with no alcoholic family history print("This data is for individuals who, as reported by this study, have not experienced alcohol abuse/dependence.")
print("Key:" + "\n"    "0. Did not answer." +"\n"    "1. Every day" +"\n"    "2. Nearly every day" +"\n"    "3. 3 to 4 times a week" +"\n"    "4. 2 times a week" +"\n"    "5. Once a week" +"\n"    "6. 2 to 3 times a month" +"\n"    "7. Once a month" +"\n"    "8. 7 to 11 times in the last year" +"\n"    "9. 3 to 6 times in the last year" +"\n"    "10. 1 or 2 times in the last year" +"\n"    "11. Have no drinken in the past 12 months." + "\n"    "99. Unknown")
subaafam4['S2AQ8A'] = pd.to_numeric(subaafam4['S2AQ8A'],errors='coerce').fillna(0).astype(int) subaafam4.loc[(subaafam4["S2AQ3"]!=9) & (subaafam4["S2AQ8A"]==0), "S2AQ8A"]=11 print("Distribution of alcohohol consumption frequency for those with NO family history of alcoholism:") subaafam4['S2AQ8A'] = pd.to_numeric(subaafam4['S2AQ8A']) print("Percentages") p4 = subaafam4['S2AQ8A'].value_counts(sort=False, dropna=False, normalize=True) print(p4.sort_index()) print("Counts") c4 = subaafam4['S2AQ8A'].value_counts(sort=False, dropna=False) print(c4.sort_index())
subaafam2['S2AQ8A'] = pd.to_numeric(subaafam2['S2AQ8A'],errors='coerce').fillna(0).astype(int) subaafam2.loc[(subaafam2["S2AQ3"]!=9) & (subaafam2["S2AQ8A"]==0), "S2AQ8A"]=11 print("Percetage of alcohohol consumption frequency for those with family history of alcoholism:") subaafam2['S2AQ8A'] = pd.to_numeric(subaafam2['S2AQ8A']) print("Percentages") p5 = subaafam2['S2AQ8A'].value_counts(sort=False, dropna=False, normalize=True) print(p5.sort_index()) print("Counts") c5 = subaafam2['S2AQ8A'].value_counts(sort=False, dropna=False) print(c5.sort_index())
This data is for individuals who, as reported by this study, have not experienced alcohol abuse/dependence. Key: 0. Did not answer. 1. Every day 2. Nearly every day 3. 3 to 4 times a week 4. 2 times a week 5. Once a week 6. 2 to 3 times a month 7. Once a month 8. 7 to 11 times in the last year 9. 3 to 6 times in the last year 10. 1 or 2 times in the last year 11. Have no drinken in the past 12 months. 99. Unknown Distribution of alcohohol consumption frequency for those with NO family history of alcoholism: Percentages 0    0.000086 1    0.030639 2    0.017079 3    0.035530 4    0.049176 5    0.064796 6    0.067456 7    0.061535 8    0.037590 9    0.069945 10   0.089341 11   0.471765 99   0.005064 Name: S2AQ8A, dtype: float64 Counts 0        1 1      357 2      199 3      414 4      573 5      755 6      786 7      717 8      438 9      815 10    1041 11    5497 99      59 Name: S2AQ8A, dtype: int64 Percetage of alcohohol consumption frequency for those with family history of alcoholism: Percentages 0    0.000253 1    0.029308 2    0.016928 3    0.043119 4    0.045393 5    0.064511 6    0.081270 7    0.064426 8    0.050109 9    0.104514 10   0.115968 11   0.382095 99   0.002105 Name: S2AQ8A, dtype: float64 Counts 0        3 1      348 2      201 3      512 4      539 5      766 6      965 7      765 8      595 9     1241 10    1377 11    4537 99      25 Name: S2AQ8A, dtype: int64
Second, I want to get an actual number of days that they drank alcohol. So, I need to change S2AQ8A into actual numbers to get an approximate number of days drinking occurred for each individual.
recode1={1:365, 2:300, 3:180, 4:104, 5:52, 6:30, 7:12, 8:9, 9:4.5, 10:1.5, 11:0} subaafam4["USFREQ"]=subaafam4["S2AQ8A"].map(recode1) print("Approximate number of days alcohol was consumed in the last 12 months for those with NO family history of alcoholism:") print("Percentages") p6 = subaafam4['USFREQ'].value_counts(sort=False, dropna=False, normalize=True) print(p6.sort_index()) print("Counts") c6 = subaafam4['USFREQ'].value_counts(sort=False, dropna=False) print(c6.sort_index())
print("Approximate number of days alcohol was consumed in the last 12 months for those with family history of alcoholism:") subaafam2["USFREQ"]=subaafam2["S2AQ8A"].map(recode1) print("Percentages") p7 = subaafam2['USFREQ'].value_counts(sort=False, dropna=False, normalize=True) print(p7.sort_index()) print("Counts") c7 = subaafam2['USFREQ'].value_counts(sort=False, dropna=False) print(c7.sort_index()) Approximate number of days alcohol was consumed in the last 12 months for those with NO family history of alcoholism: Percentages 0.000000     0.471765 1.500000     0.089341 4.500000     0.069945 9.000000     0.037590 12.000000    0.061535 30.000000    0.067456 52.000000    0.064796 104.000000   0.049176 180.000000   0.035530 300.000000   0.017079 365.000000   0.030639 nan          0.005149 Name: USFREQ, dtype: float64 Counts 0.000000      5497 1.500000      1041 4.500000       815 9.000000       438 12.000000      717 30.000000      786 52.000000      755 104.000000     573 180.000000     414 300.000000     199 365.000000     357 nan             60 Name: USFREQ, dtype: int64 Approximate number of days alcohol was consumed in the last 12 months for those with family history of alcoholism: Percentages 0.000000     0.382095 1.500000     0.115968 4.500000     0.104514 9.000000     0.050109 12.000000    0.064426 30.000000    0.081270 52.000000    0.064511 104.000000   0.045393 180.000000   0.043119 300.000000   0.016928 365.000000   0.029308 nan          0.002358 Name: USFREQ, dtype: float64 Counts 0.000000      4537 1.500000      1377 4.500000      1241 9.000000       595 12.000000      765 30.000000      965 52.000000      766 104.000000     539 180.000000     512 300.000000     201 365.000000     348 nan             28 Name: USFREQ, dtype: int64
Third, I need to multiply the result from USFREQ (number of days drinking) by S2AQ8B (number of drinks usually consumed on days when drinking) to get total number of drinks consumed in a year. Then, I’ll divide this by 12 to get # drinks consumed/month (on average).
subaafam4['S2AQ8B'] = pd.to_numeric(subaafam4['S2AQ8B'],errors='coerce').fillna(0).astype(int) subaafam4['S2AQ8B']=subaafam4['S2AQ8B'].replace(99, np.nan) #print(subaafam4["S2AQ8B"].value_counts(sort=False, dropna=False))#Just to make sure this worked subaafam4["DRINKMO"]=subaafam4["S2AQ8B"]*subaafam4["USFREQ"]/12 print("Max drinks/month") print(max(subaafam4["DRINKMO"])) #Get the idea of what the highest number needs to be. print("Average drinks per month for those with NO family history of alcoholism:") print("NaN indicates the amout of drinks consumed was unknown") #Split into categories (0-0.5), (0.5+-1), (1+-3), (3+-7.5), (7.5+-15), (15+-30), (30+-60), (60+) subaafam4["DRINKMO"]=pd.cut(subaafam4.DRINKMO, [-1,0.5,1,3,7.5,15,30,60,520]) print("Percentages") p8=subaafam4["DRINKMO"].value_counts(sort=False, dropna=False, normalize=True) print(p8.sort_index()) print("Counts") c8=subaafam4["DRINKMO"].value_counts(sort=False, dropna=False) print(c8.sort_index())
subaafam2['S2AQ8B'] = pd.to_numeric(subaafam2['S2AQ8B'],errors='coerce').fillna(0).astype(int) subaafam2['S2AQ8B']=subaafam2['S2AQ8B'].replace(99, np.nan) #print(subaafam2["S2AQ8B"].value_counts(sort=False, dropna=False))#Just to make sure this worked subaafam2["DRINKMO"]=subaafam2["S2AQ8B"]*subaafam2["USFREQ"]/12 print("Max drinks/month") print(max(subaafam2["DRINKMO"])) #Get the idea of what the highest number needs to be. print("Average drinks per month for those with family history of alcoholism") #Split into categories (0-0.5), (0.5+-1), (1+-3), (3+-7.5), (7.5+-15), (15+-30), (30+-60), (60+) subaafam2["DRINKMO"]=pd.cut(subaafam2.DRINKMO, [-1,0.5,1,3,7.5,15,30,60,520]) print("Percentages") p9=subaafam2["DRINKMO"].value_counts(sort=False, dropna=False, normalize=True) print(p9.sort_index()) print("Counts") c9=subaafam2["DRINKMO"].value_counts(sort=False, dropna=False) print(c9.sort_index()) Max drinks/month 517.0833333333334 Average drinks per month for those with NO family history of alcoholism: NaN indicates the amout of drinks consumed was unknown Percentages NaN             0.005149 (-1.0, 0.5]     0.607878 (0.5, 1.0]      0.067199 (1.0, 3.0]      0.073721 (3.0, 7.5]      0.064624 (7.5, 15.0]     0.064281 (15.0, 30.0]    0.054926 (30.0, 60.0]    0.038363 (60.0, 520.0]   0.023859 Name: DRINKMO, dtype: float64 Counts NaN                60 (-1.0, 0.5]      7083 (0.5, 1.0]        783 (1.0, 3.0]        859 (3.0, 7.5]        753 (7.5, 15.0]       749 (15.0, 30.0]      640 (30.0, 60.0]      447 (60.0, 520.0]     278 Name: DRINKMO, dtype: int64 Max drinks/month 395.4166666666667 Average drinks per month for those with family history of alcoholism Percentages NaN             0.002358 (-1.0, 0.5]     0.562911 (0.5, 1.0]      0.084049 (1.0, 3.0]      0.088092 (3.0, 7.5]      0.072343 (7.5, 15.0]     0.067458 (15.0, 30.0]    0.057268 (30.0, 60.0]    0.037056 (60.0, 520.0]   0.028466 Name: DRINKMO, dtype: float64 Counts NaN                28 (-1.0, 0.5]      6684 (0.5, 1.0]        998 (1.0, 3.0]       1046 (3.0, 7.5]        859 (7.5, 15.0]       801 (15.0, 30.0]      680 (30.0, 60.0]      440 (60.0, 520.0]     338 Name: DRINKMO, dtype: int64
Fourth, I will try to re-divide the data for the family relationship. I will have categories for both parents, parent + additional relatives, more than one additional relative, just one parent, just one non-parent relative.
subaafam2['S2DQ1'] = pd.to_numeric(subaafam2['S2DQ1']) #Father subaafam2['S2DQ2'] = pd.to_numeric(subaafam2['S2DQ2']) #Mother subaafam2['S2DQ7C2'] = pd.to_numeric(subaafam2['S2DQ7C2']) #Dad's bro subaafam2['S2DQ8C2'] = pd.to_numeric(subaafam2['S2DQ8C2']) #Dad's sis subaafam2['S2DQ9C2'] = pd.to_numeric(subaafam2['S2DQ9C2']) #Mom's bro subaafam2['S2DQ10C2'] = pd.to_numeric(subaafam2['S2DQ10C2']) #Mom's sis subaafam2['S2DQ11'] = pd.to_numeric(subaafam2['S2DQ11']) #Dad's pa subaafam2['S2DQ12'] = pd.to_numeric(subaafam2['S2DQ12']) #Dad's ma subaafam2['S2DQ13A'] = pd.to_numeric(subaafam2['S2DQ13A']) #Mom's pa subaafam2['S2DQ13B'] = pd.to_numeric(subaafam2['S2DQ13B']) #Mom's ma
#Replace all "nos" to "0"s subaafam2['S2DQ1'] = subaafam2['S2DQ1'].replace([2], 0) #Dad subaafam2['S2DQ2'] = subaafam2['S2DQ2'].replace([2], 0) #Mom subaafam2['S2DQ7C2'] = subaafam2['S2DQ7C2'].replace([2], 0) #D-bro subaafam2['S2DQ8C2'] = subaafam2['S2DQ8C2'].replace([2], 0) #D-sis subaafam2['S2DQ9C2'] = subaafam2['S2DQ9C2'].replace([2], 0) #M-bro subaafam2['S2DQ10C2'] = subaafam2['S2DQ10C2'].replace([2], 0) #M-sis subaafam2['S2DQ11'] = subaafam2['S2DQ11'].replace([2], 0) #D-pa subaafam2['S2DQ12'] = subaafam2['S2DQ12'].replace([2], 0) #D-ma subaafam2['S2DQ13A'] = subaafam2['S2DQ13A'].replace([2], 0) #M-pa subaafam2['S2DQ13B'] = subaafam2['S2DQ13B'].replace([2], 0) #M-ma
subaafam2['S2DQ1'] = subaafam2['S2DQ1'].replace([9], np.nan) #Dad subaafam2['S2DQ2'] = subaafam2['S2DQ2'].replace([9], np.nan) #Mom subaafam2['S2DQ7C2'] = subaafam2['S2DQ7C2'].replace([9], np.nan) #D-bro subaafam2['S2DQ8C2'] = subaafam2['S2DQ8C2'].replace([9], np.nan) #D-sis subaafam2['S2DQ9C2'] = subaafam2['S2DQ9C2'].replace([9], np.nan) #M-bro subaafam2['S2DQ10C2'] = subaafam2['S2DQ10C2'].replace([9], np.nan) #M-sis subaafam2['S2DQ11'] = subaafam2['S2DQ11'].replace([9], np.nan) #D-pa subaafam2['S2DQ12'] = subaafam2['S2DQ12'].replace([9], np.nan) #D-ma subaafam2['S2DQ13A'] = subaafam2['S2DQ13A'].replace([9], np.nan) #M-pa subaafam2['S2DQ13B'] = subaafam2['S2DQ13B'].replace([9], np.nan) #M-ma
subaafam2["AAFAMPAR"]=subaafam2["S2DQ1"]+subaafam2["S2DQ2"] subaafam2["AAFAMEXT"]=subaafam2["S2DQ7C2"]+subaafam2["S2DQ8C2"]+subaafam2["S2DQ9C2"]+subaafam2["S2DQ10C2"]+subaafam2["S2DQ11"]+subaafam2["S2DQ12"]+subaafam2["S2DQ13A"]+subaafam2["S2DQ13B"]
def AAFAM(row):    if row["AAFAMPAR"]==1 and row["AAFAMEXT"]==0:        return 1 #1 alcoholic parent    if row["AAFAMPAR"]==2 and row["AAFAMEXT"]==0:        return 2 #both alcoholic parents    if row["AAFAMPAR"]==0 and row["AAFAMEXT"]==1:        return 3 #one alcoholic relative    if row["AAFAMPAR"]==0 and row["AAFAMEXT"]>1:        return 4 #multiple alcoholic relatives    if row["AAFAMPAR"]==1 and row["AAFAMEXT"]>0:        return 5 #One alcoholic parent and at least 1 alcoholic relative    if row["AAFAMPAR"]>1 and row["AAFAMEXT"]>0:        return 6 #Both parents and at least 1 relative
print("For those with family history of alcohol abuse/dependece, but no personal history, I wanted to look at the distribution of family with alcoholism (parents vs extended family)" + "\n"      "Number code:" + "\n"      "1. A single alcoholic parent" + "\n"      "2. Two alcoholic parents" + "\n"      "3. One alcoholic extended relative" + "\n"      "4. Multiple alcoholic extended relatives" + "\n"      "5. One alcoholic parents and at least one alcoholic relative" + "\n"      "6. Both parents and at least one extended relative" + "\n"      "nan A parent/relative alcohol abuse/dependence was unknown") subaafam2["AAFAM"] = subaafam2.apply(lambda row: AAFAM(row), axis=1) c10=subaafam2["AAFAM"].value_counts(sort=False, dropna=False) p10=subaafam2["AAFAM"].value_counts(sort=False, dropna=False, normalize=True) print("Percentages") print(p10.sort_index()) print("Counts") print(c10.sort_index())
subaafam21 = subaafam2[["AAFAM", "S2DQ1", "S2DQ2", "S2DQ7C2", "S2DQ8C2", "S2DQ9C2", "S2DQ10C2", "S2DQ11", "S2DQ12", "S2DQ13A", "S2DQ13B"]] a=subaafam21.head(n=25) print("Categrization confirmation:") print(a) For those with family history of alcohol abuse/dependece, but no personal history, I wanted to look at the distribution of family with alcoholism (parents vs extended family) Number code: 1. A single alcoholic parent 2. Two alcoholic parents 3. One alcoholic extended relative 4. Multiple alcoholic extended relatives 5. One alcoholic parents and at least one alcoholic relative 6. Both parents and at least one extended relative nan A parent/relative alcohol abuse/dependence was unknown Percentages 1.000000   0.069227 2.000000   0.005727 3.000000   0.222335 4.000000   0.138538 5.000000   0.130537 6.000000   0.018107 nan        0.415530 Name: AAFAM, dtype: float64 Counts 1.000000     822 2.000000      68 3.000000    2640 4.000000    1645 5.000000    1550 6.000000     215 nan         4934 Name: AAFAM, dtype: int64 Categrization confirmation:      AAFAM    S2DQ1    S2DQ2   ...      S2DQ12  S2DQ13A  S2DQ13B 0  1.000000 1.000000 0.000000   ...    0.000000 0.000000 0.000000 1  5.000000 1.000000 0.000000   ...    0.000000 1.000000 0.000000 6  5.000000 1.000000 0.000000   ...    0.000000 0.000000 0.000000 14      nan 1.000000 0.000000   ...         nan      nan 0.000000 19 4.000000 0.000000 0.000000   ...    0.000000 0.000000 0.000000 20 3.000000 0.000000 0.000000   ...    0.000000 0.000000 0.000000 25 1.000000 1.000000 0.000000   ...    0.000000 0.000000 0.000000 27      nan 0.000000 0.000000   ...         nan      nan      nan 28 4.000000 0.000000 0.000000   ...    0.000000 1.000000 0.000000 33      nan 0.000000 0.000000   ...    0.000000      nan 0.000000 37 5.000000 1.000000 0.000000   ...    0.000000 1.000000 0.000000 50 3.000000 0.000000 0.000000   ...    0.000000 0.000000 0.000000 58      nan 1.000000 0.000000   ...         nan      nan      nan 60 5.000000 0.000000 1.000000   ...    0.000000 0.000000 0.000000 62      nan 1.000000 0.000000   ...         nan      nan      nan 63 3.000000 0.000000 0.000000   ...    0.000000 0.000000 0.000000 64      nan 1.000000 0.000000   ...         nan      nan      nan 70 1.000000 1.000000 0.000000   ...    0.000000 0.000000 0.000000 74      nan 0.000000 0.000000   ...    0.000000 1.000000 1.000000 75      nan 1.000000 0.000000   ...    0.000000      nan 0.000000 76 4.000000 0.000000 0.000000   ...    0.000000 0.000000 0.000000 85      nan 0.000000 0.000000   ...    0.000000 1.000000 1.000000 87      nan 1.000000 0.000000   ...         nan 0.000000 0.000000 90 1.000000 1.000000 0.000000   ...    0.000000 0.000000 0.000000 91      nan 1.000000 0.000000   ...         nan      nan      nan
[25 rows x 11 columns]
0 notes
kpenlearnspython-blog · 6 years ago
Text
Week 2
Primary research question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
I’m doing more than just three frequency tables, and I’ll explain the logic behind each one.
To answer my research question, I need to verify that there’s a substantial percentage of the total population that has not had alcohol abuse/dependence in the last 12 months or prior. I do so through here:
import pandas import numpy
nesarc = pandas.read_csv('NESARC Data.csv', low_memory=False)
#To avoid runtime errors pandas.set_option('display.float_format', lambda x:'%f'%x)
print("0. No alcohol diagnosis" + "\n"      "1. Alcohol abuse only" + "\n"      "2. Alcohol dependence only" + "\n"      "3. Alcohol abuse and dependence") print("Percentage of Alcohol abuse/dependence in last 12 months") aa12_p = nesarc['ALCABDEP12DX'].value_counts(sort=False, normalize=True) print(aa12_p)
# Alcohol abuse/dependence prior to last 12 months / 3649-3649  ALCABDEPP12DX print("Percentage of Alcohol abuse/dependence prior to last 12 months") aa0_p = nesarc['ALCABDEPP12DX'].value_counts(sort=False, normalize=True) print(aa0_p)
0. No alcohol diagnosis 1. Alcohol abuse only 2. Alcohol dependence only 3. Alcohol abuse and dependence Percentage of Alcohol abuse/dependence in last 12 months 0   0.922795 1   0.042768 2   0.012833 3   0.021604 Name: ALCABDEP12DX, dtype: float64 Percentage of Alcohol abuse/dependence prior to last 12 months 0   0.735085 1   0.162300 2   0.013065 3   0.089551 Name: ALCABDEPP12DX, dtype: float64
These statistics validate for me that a large percentage of the data set has no alcohol diagnosis (0), so it should be substantially large enough for me to exclude those with any history of alcohol abuse/dependence. To do this, I took a subset of the data.
subaa = nesarc[(nesarc['ALCABDEP12DX']==0) & (nesarc['ALCABDEPP12DX']==0)]
subaa1=subaa.copy()
Next, I want to look at people with a reported family history of alcohol abuse. I just need them to have answered yes (1) to one of the questions asking about an elder relative, so I strung everything together with “or” (|) statements. I then wanted to look at the percentage of people with no previous alcohol abuse/dependence have an elder family member with alcohol abuse/dependence. For relative reference:
Alcoholic dad / 653-653   S2DQ1
Alcoholic mom / 654-654  S2DQ2
Alcoholic uncle / 679-679, 689-689  S2DQ7C2 S2DQ9C2
Alcoholic aunt / 684-684, 694-694  S2DQ8C2   S2DQ10C2
Alcoholic grandpa / 695-695, 697-697  S2DQ11 S2DQ13A
Alcoholic grandma / 696-696, 698-698  S2DQ12 S2DQ13B
subaafam20 = subaa1[(subaa1['S2DQ1']==1) | (subaa1['S2DQ2']==1) |        (subaa1['S2DQ7C2']==1) | (subaa1['S2DQ9C2']==1) | (subaa1['S2DQ8C2']==1)         | (subaa1['S2DQ10C2']==1) | (subaa1['S2DQ11']==1) | (subaa1['S2DQ13A']==1)          | (subaa1['S2DQ12']==1) | (subaa1['S2DQ13B']==1)]
subaafam2=subaafam20.copy()
#Then, I will compare the length of this data set to the length of the non- #diagnosed data set to get a percentage. print("The percentage of those with no history of alcohol dependence/abuse with a known, elder relative with alcoholism:") print(len(subaafam2)/len(subaa1)*100) The percentage of those with no history of alcohol dependence/abuse with a known, elder relative with alcoholism: 37.9968
I will also need a separate subset for those with no known family member (answered no, or 2, to every question asking them if they had an alcoholic family member). I am eliminating anyone where things are unknown. This is an & subsetting since it needs to apply to all the family members.
subaafam40 = subaa1[(subaa1['S2DQ1']==2) & (subaa1['S2DQ2']==2) &        (subaa1['S2DQ7C2']==2) & (subaa1['S2DQ9C2']==2) & (subaa1['S2DQ8C2']==2)         & (subaa1['S2DQ10C2']==2) & (subaa1['S2DQ11']==2) & (subaa1['S2DQ13A']==2)          & (subaa1['S2DQ12']==2) & (subaa1['S2DQ13B']==2)]
subaafam4=subaafam40.copy()
print("The percentage of those with no history of alcohol dependence/abuse with no family history of alcoholism:") print(len(subaafam4)/len(subaa1)*100) The percentage of those with no history of alcohol dependence/abuse with no family history of alcoholism: 37.2864
So, there is a similar percentage of respondents that have family history of alcohol abuse/dependence and those that don’t. Those where the alcohol abuse/dependence of family members is unknown have been eliminated from the data set, hence why the two statistics don’t equal 100%.
Now, to use these subsets to look at data.
First, I want to look at the percentage distribution of the responses for the number of alcoholic drinks of days when drinking for those with no personal history of alcohol abuse/dependence and no family history of alcohol abuse/dependence. The number listed in the first column is the number of drinks.
subaafam4['S2AQ8B'] = pandas.to_numeric(subaafam4['S2AQ8B'],errors='coerce').fillna(0).astype(int) print("Percentage of # drinks on days when drinking for those with NO family history:") p2 = subaafam4['S2AQ8B'].value_counts(sort=False, dropna=False,normalize=True) print(p2.sort_index()) Percentage of # drinks on days when drinking for those with NO family history: 0    0.471850 1    0.263560 2    0.157140 3    0.057587 4    0.020512 5    0.006351 6    0.011929 7    0.001459 8    0.001459 9    0.000429 10   0.000944 12   0.002060 15   0.000172 17   0.000086 20   0.000086 24   0.000086 99   0.004291 Name: S2AQ8B, dtype: float64
The largest percentage of people in this data set abstain from drinking (0). 99 accounts for people where drinking is unknown. 
For those with family history:
subaafam2['S2AQ8B'] = pandas.to_numeric(subaafam2['S2AQ8B'],errors='coerce').fillna(0).astype(int)
print("Percentage of # drinks on days when drinking for those with family history:") p1 = subaafam2['S2AQ8B'].value_counts(sort=False, dropna=False, normalize=True) print(p1.sort_index()) Percentage of # drinks on days when drinking for those with family history: 0    0.382348 1    0.296109 2    0.182500 3    0.073859 4    0.026697 5    0.010190 6    0.015580 7    0.002358 8    0.002358 9    0.000421 10   0.001937 12   0.001600 13   0.000253 14   0.000084 15   0.000337 20   0.000084 24   0.000168 48   0.000084 99   0.003032 Name: S2AQ8B, dtype: float64
Perhaps interestingly, the group with family history of alcoholism has a somewhat lower percentage of people who abstain from drinking. There seems to be a slight increase in people who have 1-2 drinks. Overall, there’s no striking difference with the previous data set, and it’s hard to gauge without further analysis.
Finally, I wanted to look at the percent distribution of responses for how often these individuals reported drinking in the last 12 months.
print("Key:" + "\n"    "0. NA, former drinker or lifetime abstainer" +"\n"    "1. Every day" +"\n"    "2. Nearly every day" +"\n"    "3. 3 to 4 times a week" +"\n"    "4. 2 times a week" +"\n"    "5. Once a week" +"\n"    "6. 2 to 3 times a month" +"\n"    "7. Once a month" +"\n"    "8. 7 to 11 times in the last year" +"\n"    "9. 3 to 6 times in the last year" +"\n"    "10. 1 or 2 times in the last year" +"\n"    "99. Unknown")
subaafam2['S2AQ8A'] = pandas.to_numeric(subaafam2['S2AQ8A'],errors='coerce').fillna(0).astype(int)
print("Percentage of alcohol consumption frequency for those with family history of alcoholism:") p3 = subaafam2['S2AQ8A'].value_counts(sort=False, dropna=False, normalize=True) print(p3.sort_index())
subaafam4['S2AQ8A'] = pandas.to_numeric(subaafam4['S2AQ8A'],errors='coerce').fillna(0).astype(int)
print("Percentage of alcohol consumption frequency for those with NO family history of alcoholism:") p4 = subaafam4['S2AQ8A'].value_counts(sort=False, dropna=False, normalize=True) print(p4.sort_index()) Key: 0. NA, former drinker or lifetime abstainer 1. Every day 2. Nearly every day 3. 3 to 4 times a week 4. 2 times a week 5. Once a week 6. 2 to 3 times a month 7. Once a month 8. 7 to 11 times in the last year 9. 3 to 6 times in the last year 10. 1 or 2 times in the last year 99. Unknown Percentage of alcohol consumption frequency for those with family history of alcoholism: 0    0.382348 1    0.029308 2    0.016928 3    0.043119 4    0.045393 5    0.064511 6    0.081270 7    0.064426 8    0.050109 9    0.104514 10   0.115968 99   0.002105 Name: S2AQ8A, dtype: float64 Percentage of alcohol consumption frequency for those with NO family history of alcoholism: 0    0.471850 1    0.030639 2    0.017079 3    0.035530 4    0.049176 5    0.064796 6    0.067456 7    0.061535 8    0.037590 9    0.069945 10   0.089341 99   0.005064 Name: S2AQ8A, dtype: float64
Once again, as we would expect given the previous data set, we see the same trend that the individuals with no family history of alcoholism have a higher percentage of those abstaining from alcohol than those with family history. On their face, the other percentages seem fairly similar.
In future data analysis, I might need to combine the # of drinks consumed with the alcohol consumption frequency to form a rate for each individual. This rate will assist in future analysis as it would give me a better idea of typical alcohol consumption.
0 notes
kpenlearnspython-blog · 6 years ago
Text
Data Management and Visualization Week 1: Topic Selection
1) Data set: National Epidemiological Survey of Drug Use and Health (NESARC)
2) Primary research question: Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
Primary hypothesis: Despite having no history or alcohol abuse or dependence, having a familial history with alcoholism increases rate of alcohol consumption.
Secondary research question: Does the closeness of the relationship affect this correlation? (Parent vs more distant relation, present vs. absent alcoholic parent)
Secondary hypothesis:  A close relationship with someone with alcohol dependence is more likely to result in an increased rate of alcohol consumption.
3) Literature Review:
Google Scholar Search: “family history alcoholism affecting alcohol consumption”
O’Malley, S. S., et al. Effects of Family Drinking History and Expectancies on Responses to Alcohol in Men. Journal of Studies on Alcohol. 1985. 46: 289-297.
24 men with history of parental alcoholism compared with 24 matched controls without parental alcoholism
Nonproblem drinkers
Those with parental alcoholism reported feeling less drunk that those without upon drinking the same amount, though levels of tolerance and BAL was similar.
Cotton, N. S. The familial incidence of alcoholism: a review. Journal of Studies on Alcohol. 1979. 1: 89-116.
Reviewed lots of other studies.
Every family with a history of alcohol consumption more likely to become an alcoholic.
 Google Scholar Search: “alcoholic parent alcohol consumption”
 Dube, S. R., et al. Adverse childhood experiences and personal alcohol abuse as an adult. Addictive Behaviors. 2002. 27: 713-725.
Alcohol abuse linked to childhood abuse and family dysfunction
Not much information about adverse childhood experiences linked to later alcohol abuse of individual
Tested 8 different adverse childhood experiences. Experiencing these were all associated with higher incidence of alcohol abuse.
Compared to people with none of these incidences, association with alcoholism was doubled for those that had one.For those with multiple, it increased four-fold.
Looked at verbal abuse, physical abuse, sexual abuse, battered mother, household substance abuse, mental illness in household, parental separation or divorce, incarcerated household members
Found that specifically for alcoholism, if you had one parent who was an alcoholic, you were much more likely to have an alcohol problem
Less prominent in women and men, but the trend held
Goodwin, D. W., et al. Alcohol problems in adoptees raised apart from alcoholic biological parents. Archives of General Psychiatry. 1973. 28: 238-243.
Only looked at 55 men
Children separated from alcoholic parent.
Still much more likely than the control to develop into alcoholics.
4) Literature Review Summary: 
All the literature indicates that people who had alcoholic parents are more likely to become alcoholics and tend to feel less drunk on the same amount of alcohol compared to a control group. Both men and women were considered in these studies. 
While I am going to try to center my study on those that do not have a reported alcohol dependence or abuse problem, this literature survey suggests that they will likely still be genetically predisposed to drink more.
Needed data outline:
Is there an association of family history of alcoholism to the rate (# drinks/week) of alcohol consumption for people who have never exhibited alcohol abuse or dependence?
·       Hypothesis: Despite having no history or alcohol abuse or dependence, having a familial history with alcoholism increases rate of alcohol consumption.
·       Needed data:
o   Alcohol consumption rate
§  Have had at least 1 drink / 313-313
§  Drink at least 12 drinks in 12 months / 314-314
§  Drink at least 1 drink in 12 months / 315-315
§  Drinking status / 316-316
§  How often drank alcohol in last 12 months / 397-398
§  Number of drinks of alcohol on days when drinking / 399-400
§  Largest number of drinks 12 months / 401-402
§  How often drank large number / 403-404
§  How often drank 5+ / 405-406
§  How often drank 4+ / 407-408
o   Family member was an alcoholic
§  Only considering relatives of previous generation or before. Not siblings or children.
§  Alcoholic dad / 653-653
§  Alcoholic mom / 654-654
§  Alcoholic uncle / 679-679, 689-689
§  Alcoholic aunt / 684-684, 694-694
§  Alcoholic grandpa / 695-695, 697-697
§  Alcoholic grandma / 696-696, 698-698
o   Does this hold true for people who are not currently alcoholics?
§  Alcohol abuse/dependence in last 12 months / 3648-3648
§  Alcohol abuse/dependence prior to last 12 months / 3649-3649
·       Correcting factors:
o   Income (correct primary data set if possible)
§  179-185
§  186-187
§  188-188
§  189-195
§  196-197
§  198-198
§  199-205
§  206-207
§  208-208
o   Age (correct primary data set if possible)
§  71-72
§  73-74
§  75-78
o   Ethnicity (might need to account for ethnicities that drink less)
§  81-81 until 89-90
o   Sex (not sure if this matters)
§  79-79
·       Secondary question:
o   Does the closeness of the relationship affect this correlation? (Parent vs more distant relation, present vs. absent alcoholic parent)
§  Hypothesis: A close relationship with someone with alcohol dependence is more likely to result in an increased rate of alcohol consumption.
§  Focus on those with biological parents since only 3% didn’t live with a  biological parent
§  Needed data:
·       Parental association growing up (Whether one or both parents was present during upbringing):
o   Lived with at least 1 biological parent before age 18. / 94-94
o   Biological father ever lived in household before respondent was 18. / 95-95
o   Did biological or adoptive parents get divorced before 18? / 101-101
o   Age when biological parents stopped living together / 102-103
o   Ever lived with a step-parent before age 18 / 105-105
o   Age when started living with step parent / 106-107
Can use previous information on family stuff for rest.
1 note · View note