cjdathw1
cjdathw1
Cigarettes Smoked vs. Health Conditions
1 post
Don't wanna be here? Send us removal request.
cjdathw1 · 4 years ago
Text
Assignment 1
Assignment 1
This assignment is aim to test differences in the mean number of cigarettes smoked among young adults of 5 different levels of health condition. The data for this study is from  U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC). The response variable “NUMCIGMO_EST” is the number cigarettes smoked in a month based on the young adults age 18 to 25 who have smoked in the past 12 month.  NUMCIGMO_EST is created as follows:  1. “USFREQMO” is converted to number of days smoked in the past month, 2. multiplying new “USFREQMO”  and “S3AQ3C1" which is cigarettes/day.  The explanatory variable in the test is column “S1Q16”, which is described as “SELF-PERCEIVED CURRENT HEALTH”, with values from 1 (excellent) to 5 (poor).  Since this is a C->Q test, thus I ran ANOVA using OLS model to calculate F-statistic and p-value,
Here is the output from the OLS output:
=========================================================================== Dep. Variable:           NUMCIGMO_EST   R-squared:                       0.031 Model:                            OLS                    Adj. R-squared:                0.025 Method:                 Least Squares             F-statistic:                        5.001 Date:                Mon, 22 Mar 2021            Prob (F-statistic):        0.000567 Time:                               19:54:15            Log-Likelihood:              -4479.1 No. Observations:                   629            AIC:                                   8968. Df Residuals:                           624           BIC:                                    8991. Df Model:                                     4                                         Covariance Type:            nonrobust                                         ===========================================================================                                   coef      std err               t      P>|t|      [0.025      0.975] ----------------------------------------------------------------------------------- Intercept              315.4803     22.538     13.997     0.000     271.220     359.741 C(S1Q16)[T.2.0]    15.4845     30.537      0.507      0.612     -44.483      75.452 C(S1Q16)[T.3.0]    89.5654     31.530      2.841      0.005      27.648     151.482 C(S1Q16)[T.4.0]   119.1863     50.173      2.376      0.018      20.658     217.715 C(S1Q16)[T.5.0]   348.8054    115.867     3.010      0.003     121.268     576.342 =========================================================================== Omnibus:                      346.462   Durbin-Watson:                       2.001 Prob(Omnibus):                0.000   Jarque-Bera (JB):             4387.366 Skew:                                2.167   Prob(JB):                                  0.00 Kurtosis:                          15.191   Cond. No.                                 10.7 ===========================================================================
With F-statistic is 5.001 and p-value is 0,000567,  I can safely reject the null hypothesis and conclude that there is an association between number of cigarettes smoked and the health conditions.  To determine which levels of health conditions are different from the others, I perform a post hoc test using the Tukey HSDT, or Honestly Significant Difference Test (this is implemented by calling “MultiComparison” function).  The following shows the result of the Post Hoc Pair Comparison:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05   ======================================================= group1 group2 meandiff p-adj      lower        upper      reject -------------------------------------------------------   1.0     2.0        15.4845    0.9      -68.0574   99.0263   False   1.0     3.0         89.5654 0.0374     3.3073  175.8234    True   1.0     4.0       119.1863 0.1236   -18.0761  256.4488  False   1.0     5.0       348.8054 0.0227    31.8177  665.7931    True   2.0     3.0         74.0809 0.1025    -8.4764   156.6383  False   2.0     4.0       103.7019 0.2205   -31.2657  238.6694  False   2.0     5.0       333.3209 0.0328    17.3202  649.3216    True   3.0     4.0         29.621    0.9      -107.0445  166.2864   False   3.0     5.0       259.24 0.1668       -57.4896  575.9697   False   4.0     5.0       229.619 0.3295    -104.6237  563.8618  False -------------------------------------------------------
In the last column, we can determine which health level of groups smoke significantly different mean number of cigarettes than the others by identifying the comparisons in which we can reject the null hypothesis, that is, in which reject equals true.
Python script for the homework:
import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
data = pandas.read_csv('nesarc.csv', low_memory=False)
data['S3AQ3B1'] = pandas.to_numeric(data['S3AQ3B1'], errors="coerce")           data['S3AQ3C1'] = pandas.to_numeric(data['S3AQ3C1'], errors="coerce")           data['CHECK321'] = pandas.to_numeric(data['CHECK321'], errors="coerce")        
#subset data to young adults age 18 to 25 who have smoked in the past 12 months sub1=data[(data['AGE']>=18) & (data['AGE']<=25) & (data['CHECK321']==1)]
#SETTING MISSING DATA sub1['S3AQ3B1']=sub1['S3AQ3B1'].replace(9, numpy.nan) sub1['S3AQ3C1']=sub1['S3AQ3C1'].replace(99, numpy.nan)
#recoding number of days smoked in the past month recode1 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1} sub1['USFREQMO']= sub1['S3AQ3B1'].map(recode1)                                  
#converting new variable USFREQMMO to numeric sub1['USFREQMO']= pandas.to_numeric(sub1['USFREQMO'], errors="coerce")
sub1['NUMCIGMO_EST']=sub1['USFREQMO'] * sub1['S3AQ3C1']                        
sub1['NUMCIGMO_EST']= pandas.to_numeric(sub1['NUMCIGMO_EST'], errors="coerce")
sub1['S1Q16']=sub1['S1Q16'].replace(9, numpy.nan) sub3 = sub1[['NUMCIGMO_EST', 'S1Q16']].dropna()
''' By running an analysis of variance, we're asking whether the number of cigarettes smoked differs for different health conditions. ''' model2 = smf.ols(formula='NUMCIGMO_EST ~ C(S1Q16)', data=sub3).fit() print (model2.summary())
print ('means for numcigmo_est by major depression status') m2= sub3.groupby('S1Q16').mean() print (m2)
print ('standard deviations for numcigmo_est by major depression status') sd2 = sub3.groupby('S1Q16').std() print (sd2)
mc1 = multi.MultiComparison(sub3['NUMCIGMO_EST'], sub3['S1Q16']) res1 = mc1.tukeyhsd() print(res1.summary())
1 note
1 note · View note