cattaria-blog - Tumblr blog

cattaria-blog · 7 years ago

Text

Testing a potential moderator

In this task, I’ll try to figure out whether there is any connection between daily levels of cigarette and ethanol consumption (for those people who both smoke and drink) with relation to their potential moderator – depression. That means, are there any differences in association between daily levels of cigarette and ethanol consumption for depressed and not depressed people?

Here is some information about those variables.

S3AQ3B1 USUAL FREQUENCY WHEN SMOKED CIGARETTES

--------------------------------------

14836 1. Every day

460 2. 5 to 6 Day(s) a week

687 3. 3 to 4 Day(s) a week

747 4. 1 to 2 Day(s) a week

409 5. 2 to 3 Day(s) a month

772 6. Once a month or less

102 9. Unknown

25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

--------------------------------------

S3AQ3C1 USUAL QUANTITY WHEN SMOKED CIGARETTES

-------------------------------------

17751 1-98. Cigarette(s)

262 99. Unknown

25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

-------------------------------------

I am going to multiply these variables to find out monthly cigarette consumption, and then divide by 30 to calculate the average daily level.

ETOTLCA2

AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL

TYPES OF ALCOHOLIC BEVERAGES COMBINED

(NOTE: Users may wish to exclude outliers)

---------------------------------------------------------------

0.0003 - 219.9555 Ounces of ethanol

Possible moderator:

MAJORDEP12 MAJOR DEPRESSION IN LAST 12 MONTHS (NON-HIERARCHICAL)

-----------------------------------------------------

39608 0. No

3485 1. Yes

-----------------------------------------------------

Here are my steps:

1) Importing necessary libraries

import numpy

import pandas

import scipy

import seaborn

import matplotlib.pyplot as plt

2) Read the csv file of the dataset and fix the numeric datatypes

data = pandas.read_csv("nesarc.csv", low_memory=False)

data = data.apply(pandas.to_numeric, errors='coerce')

3) After examining the data, I found out that the variables had some missing values (replaced by ‘9’s and ‘99’s). These do not bear any meaning and thus useless for the analysis, so I replaced them with NaNs to delete them later.

data['S3AQ3B1']=data['S3AQ3B1'].replace(9, numpy.nan)

data['S3AQ3C1']=data['S3AQ3C1'].replace(99, numpy.nan)

4) Recoding variables for monthly smoking frequency and writing the new values into a new column

recode1 = {1: 30, 2: 22, 3: 14, 4: 6, 5: 2.5, 6: 1}

data['USFREQMO']= data['S3AQ3B1'].map(recode1)

5) Deleting NaN values (we can’t calculate correlation with empty values)

data2 = data[['ETOTLCA2','USFREQMO','S3AQ3C1']].dropna()

6) Calculating average daily cigarette consumption

data2['cigsperday'] = data2['USFREQMO']*data2['S3AQ3C1']/30

7) Deleting outliers that lie farther than 3 standard deviations from the mean

data2 = data2[numpy.abs(data2.ETOTLCA2-data2.ETOTLCA2.mean())<=(3*data2.ETOTLCA2.std())]

data2 = data2[numpy.abs(data2.cigsperday-data2.cigsperday.mean())<=(3*data2.cigsperday.std())]

8) Splitting the data into two parts: for depressed and not depressed people

nodep=data2[(data2['MAJORDEP12']== 0)]

dep=data2[(data2['MAJORDEP12']== 1)]

9) Plotting two subsets on a scatter plot

scat1 = seaborn.regplot(x="cigsperday", y="ETOTLCA2", fit_reg=True, data=nodep,label='without depression')

scat2 = seaborn.regplot(x="cigsperday", y="ETOTLCA2", fit_reg=True, data=dep,label='with depression')

plt.legend()

plt.xlabel('cigarettes')

plt.ylabel('Ethanol (ounces)')

plt.title('association between daily cigarette and ethanol consumption')

By looking at the picture, it can be said that distributions of answers for both groups look pretty similar. Lines of trend look almost identical, as well. But to prove it, we need to calculate Pearson’s correlation coefficients for both groups.

10) Calculating and printing Pearson’s r.

print ('association between daily cigarette and ethanol consumption')

print ('without depression',scipy.stats.pearsonr(nodep['cigsperday'], nodep['ETOTLCA2']))

print ('with depression',scipy.stats.pearsonr(dep['cigsperday'], dep['ETOTLCA2']))

association between daily cigarette and ethanol consumption

without depression (0.08127268338354372, 1.1026806253816316e-17)

with depression (0.060290856780329985, 0.031612603167820254)

We can see that both associations are significant (p<0.05), but not very strong: both Pearson’s coefficients do not exceed 0.08, which is relatively close to zero. So here we could say that there is almost no connection between daily cigarette and ethanol consumption, for both depressed people and those without it.

0 notes

cattaria-blog · 7 years ago

Text

Pearson’s correlation coefficient

In this task, I’ll try to figure out whether there is any connection between daily levels of cigarette and ethanol consumption (for those people who both smoke and drink).

Here is some information about those variables.

S3AQ3B1 USUAL FREQUENCY WHEN SMOKED CIGARETTES

--------------------------------------

14836 1. Every day

460 2. 5 to 6 Day(s) a week

687 3. 3 to 4 Day(s) a week

747 4. 1 to 2 Day(s) a week

409 5. 2 to 3 Day(s) a month

772 6. Once a month or less

102 9. Unknown

25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

--------------------------------------

S3AQ3C1 USUAL QUANTITY WHEN SMOKED CIGARETTES

-------------------------------------

17751 1-98. Cigarette(s)

262 99. Unknown

25080 BL. NA, never or unknown if ever smoked 100+ cigarettes

-------------------------------------

I am going to multiply these variables to find out monthly cigarette consumption, and then divide by 30 to calculate the average daily level.

ETOTLCA2

AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL

TYPES OF ALCOHOLIC BEVERAGES COMBINED

(NOTE: Users may wish to exclude outliers)

---------------------------------------------------------------

0.0003 - 219.9555 Ounces of ethanol

To find out if there’s a linear connection between these variables, I will need to calculate Pearson’s r coefficient. It would be useful to plot the data on a scatter plot, too. Here are the steps to do these things:

1) Importing necessary libraries

import numpy

import pandas

import scipy

import seaborn

import matplotlib.pyplot as plt

2) Read the csv file of the dataset and fix the numeric datatypes

data = pandas.read_csv("nesarc.csv", low_memory=False)

data = data.apply(pandas.to_numeric, errors='coerce')

data['S3AQ3B1']=data['S3AQ3B1'].replace(9, numpy.nan)

data['S3AQ3C1']=data['S3AQ3C1'].replace(99, numpy.nan)

4) Recoding variables for monthly smoking frequency and writing the new values into a new column

recode1 = {1: 30, 2: 22, 3: 14, 4: 6, 5: 2.5, 6: 1}

data['USFREQMO']= data['S3AQ3B1'].map(recode1)

5) Deleting NaN values (we can’t calculate correlation with empty values)

data2 = data[['ETOTLCA2','USFREQMO','S3AQ3C1']].dropna()

6) Calculating average daily cigarette consumption

data2['cigsperday'] = data2['USFREQMO']*data2['S3AQ3C1']/30

7) Plotting points on a scatter plot

scat1 = seaborn.regplot(x="cigsperday", y="ETOTLCA2", fit_reg=True, data=data2)

plt.xlabel('Cigarettes')

plt.ylabel('Ethanol (ounces)')

plt.title('association between daily cigarette and ethanol consumption')

It is hard to tell whether there is any kind of correlation between smoking and drinking levels or not, so let’s calculate Pearson’s correlation coefficient.

8) Calculating and printing Pearson’s r.

print ('association between daily cigarette and ethanol consumption')

print (scipy.stats.pearsonr(data2['cigsperday'], data2['ETOTLCA2']))

association between daily cigarette and ethanol consumption

(0.05521844230892321, 5.232596522904821e-10)

Even though the association is significant (p<0.0001) the connection is very weak at 0.055 (it is relatively close to zero), almost nonexistent. So we could say there is almost no connection between daily cigarette and ethanol consumption.

Additionally, we can calculate R-squared. It will help us determine the amount of variability in one variable that can be predicted by the other.

(scipy.stats.pearsonr(data2['cigsperday'], data2['ETOTLCA2'])[0])**2

0.0030490763710238813

R-squared here is about 0.3%, which is very little. Almost no variability of response variable can be explained by explanatory, which further proves our point of little to no association between the variables.

0 notes

cattaria-blog · 7 years ago

Text

Chi-square test & post-hoc

In this task, I’ll try to find out if a connection exists between a person’s nicotine dependence and their mood (how often they felt depressive/downhearted during 4 weeks prior to the survey).

Here is some information about those variables.

S1Q213 DURING PAST 4 WEEKS, HOW OFTEN FELT DOWNHEARTED AND DEPRESSED

-------------------------------------------------------------

907 1. All of the time

2051 2. Most of the time

6286 3. Some of the time

11305 4. A little of the time

22127 5. None of the time

417 9. Unknown

--------------------------------------------------------------

TAB12MDX NICOTINE DEPENDENCE IN THE LAST 12 MONTHS

-----------------------------------------

38131 0. No nicotine dependence

4962 1. Nicotine dependence

-----------------------------------------

The null hypothesis is that there is no association between these two variables, which means that depression of a person and nicotine dependence are not connected in any way.

An alternative hypothesis is that there is some kind of a connection between these variables.

To test it, I’m going to do a chi-square test and later a post-hoc test.

Here’s what I did:

1) Importing necessary libraries

import numpy

import pandas

import scipy

import seaborn

import matplotlib.pyplot as plt

2) Read the csv file of the dataset and fix the numeric datatypes

data = pandas.read_csv("nesarc.csv", low_memory=False)

data = data.apply(pandas.to_numeric, errors='coerce')

3) After examining the data, I found out that the explanatory variable (S1Q213, DURING PAST 4 WEEKS, HOW OFTEN FELT DOWNHEARTED AND DEPRESSED) had some missing values (replaced by ‘9’s). These do not bear any meaning and thus useless for the analysis, so I replaced them with NaNs to delete them later.

data['S1Q213'].replace(9,numpy.nan, inplace=True)

4) Deleting rows with NaNs present in columns related to variables that I’ll use.

data2 = data[['TAB12MDX','S1Q213']].dropna()

5) Creating a contingency table of observed counts and printing it

ct1=pandas.crosstab(data2['TAB12MDX'], data2['S1Q213'])

print (ct1)

S1Q213 1.0 2.0 3.0 4.0 5.0

TAB12MDX

0 695 1613 5275 9834 20306

1 212 438 1011 1471 1821

6) Calculating column percentages and printing them

colsum=ct1.sum(axis=0)

colpct=ct1/colsum

print(colpct)

S1Q213 1.0 2.0 3.0 4.0 5.0

TAB12MDX

0 0.766262 0.786446 0.839166 0.869881 0.917702

1 0.233738 0.213554 0.160834 0.130119 0.082298

7) Calculating chi-square value, p-value and expected counts

print ('chi-square value, p value, expected counts')

cs1= scipy.stats.chi2_contingency(ct1)

print (cs1)

chi-square value, p value, expected counts

(702.9278944655605, 8.09583837665881e-151, 4, array([[ 801.73308183, 1812.95981348, 5556.44338738, 9992.93549067, 19558.92822664], [ 105.26691817, 238.04018652, 729.55661262, 1312.06450933, 2568.07177336]]))

We can see that p-value is less than 0.05 so we conclude that there is an association between severity of depression and nicotine dependence. To figure out what kind of dependence, and between which categories, I’ll do pairwise chi-square test for all categories. To deflate the error, we’re going to use the Bonferroni adjustment: I will compare p-values not with the original 0.05 value, but with a value of 0.05 divided by a number of comparisons, which is 10.

So p must be less than 0.005 for results to be significant.

8) Finding number of categories for depressive state (here it would be 5).

n = data2.S1Q213.nunique()

9) Instead of writing each test by hand, I’m using a for-loop checking each combination of categories. I use a condition of i < j here so there won’t be any repeated pairs.

for i in range(1,n+1):

for j in range(1,n+1):

if i < j:

recode = {i:i, j:j}

data2['PAIRWISECOMP']= data2['S1Q213'].map(recode)

ct2=pandas.crosstab(data2['TAB12MDX'], data2['PAIRWISECOMP'])

#print(ct2)

colsum=ct2.sum(axis=0)

colpct=ct2/colsum

cs2= scipy.stats.chi2_contingency(ct2)

print(colpct)

cs1= scipy.stats.chi2_contingency(ct1)

print('chi-square value:',cs1[0])

print ('p-value',cs1[1],"\n")

PAIRWISECOMP 1.0 2.0

TAB12MDX

0 0.766262 0.786446

1 0.233738 0.213554

chi-square value: 1.3787835738477932

p-value 0.24030845764454745

PAIRWISECOMP 1.0 3.0

TAB12MDX

0 0.766262 0.839166

1 0.233738 0.160834

chi-square value: 29.339003768193678

p-value 6.076032277226904e-08

PAIRWISECOMP 1.0 4.0

TAB12MDX

0 0.766262 0.869881

1 0.233738 0.130119

chi-square value: 74.99963145388048

p-value 4.7080193470524695e-18

PAIRWISECOMP 1.0 5.0

TAB12MDX

0 0.766262 0.917702

1 0.233738 0.082298

chi-square value: 246.4365273816259

p-value 1.5535697740039353e-55

PAIRWISECOMP 2.0 3.0

TAB12MDX

0 0.786446 0.839166

1 0.213554 0.160834

chi-square value: 29.567080989009575

p-value 5.401454388734273e-08

PAIRWISECOMP 2.0 4.0

TAB12MDX

0 0.786446 0.869881

1 0.213554 0.130119

chi-square value: 97.97325205524267

p-value 4.2407233316281255e-23

PAIRWISECOMP 2.0 5.0

TAB12MDX

0 0.786446 0.917702

1 0.213554 0.082298

chi-square value: 380.2332811381517

p-value 1.1070629481782638e-84

PAIRWISECOMP 3.0 4.0

TAB12MDX

0 0.839166 0.869881

1 0.160834 0.130119

chi-square value: 31.19382508236924

p-value 2.3350766378586633e-08

PAIRWISECOMP 3.0 5.0

TAB12MDX

0 0.839166 0.917702

1 0.160834 0.082298

chi-square value: 335.5906941492885

p-value 5.823139114082824e-75

PAIRWISECOMP 4.0 5.0

TAB12MDX

0 0.869881 0.917702

1 0.130119 0.082298

chi-square value: 192.21580637739942

p-value 1.043957307193602e-43

PAIRWISECOMP 3.0 5.0

TAB12MDX

0 0.839166 0.917702

1 0.160834 0.082298

chi-square value: 702.9278944655605

p-value 8.09583837665881e-151

PAIRWISECOMP 4.0 5.0

TAB12MDX

0 0.869881 0.917702

1 0.130119 0.082298

chi-square value: 702.9278944655605

p-value 8.09583837665881e-151

Almost all of the pairs are significantly different. The only pair that is not is 1 and 2: those are the people who felt depressed all the time (1) and most of the time (2). All of the other groups are statistically different from each other. To see how are they different, exactly, let’s plot the values.

seaborn.catplot(x="S1Q213", y="TAB12MDX", data=data2,kind='bar', ci=None)

plt.xlabel('Severity of depressive thoughts (from most to least)')

plt.ylabel('Nicotine dependence')

We can see a clear downward trend in these data, which means that people who felt depressed more often are more likely to have a nicotine dependence. The two of the most depressed categories, however, are statistically similar.

0 notes

cattaria-blog · 7 years ago

Text

ANOVA & post hoc

In this task, I’ll try to find out if connection exists between average daily ethanol consumption (in ounces) of a person and their mood (how often they felt depressive/downhearted during 4 weeks prior to the survey).

Here is some information about those variables.

Explanatory variable: categorical, 5 categories (unknown values will be deleted since they do not give any useful information).

S1Q213 DURING PAST 4 WEEKS, HOW OFTEN FELT DOWNHEARTED AND DEPRESSED

1. All of the time

2. Most of the time

3. Some of the time

4. A little of the time

5. None of the time

9. Unknown

Response variable, quantitative (blanks to be deleted later).

ETOTLCA2 AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL

TYPES OF ALCOHOLIC BEVERAGES COMBINED

0.0003 - 219.9555 Ounces of ethanol

Blank Unknown

Let’s formulate our null hypothesis first: there is no relation between presence of depression ( and severity of it) and their drinking habits ( = all means are equal).

An alternative hypothesis would be that there is some kind of a relation between daily ethanol consumption and depression.

To test the hypothesis, I conducted an Analysis of Variance with Python.

In order to do that, I did the following:

1) Importing necessary libraries

import numpy

import pandas

import statsmodels.formula.api as smf

2) Read the csv file of the dataset and fix the numeric datatypes

data = pandas.read_csv("nesarc.csv", low_memory=False)

data = data.apply(pandas.to_numeric, errors='coerce')

data['S1Q213'].replace(9, numpy.nan, inplace=True)

4) Deleting rows with NaNs present in columns related to variables that I’ll use.

data2 = data[['ETOTLCA2','S1Q213']].dropna()

5) Creating an ordinary least squares regression model to find out the needed F-statistic and p-value, fitting and printing the summary of the model.

model2 = smf.ols(formula = "ETOTLCA2 ~ C(S1Q213)", data = data2)

results2=model2.fit()

print(results2.summary())

OLS Regression Results

==============================================================================

Dep. Variable: ETOTLCA2 R-squared: 0.003

Model: OLS Adj. R-squared: 0.003

Method: Least Squares F-statistic: 20.23

Date: Sun, 21 Oct 2018 Prob (F-statistic): 1.18e-16

Time: 02:11:35 Log-Likelihood: -58890.

No. Observations: 26598 AIC: 1.178e+05

Df Residuals: 26593 BIC: 1.178e+05

Df Model: 4

Covariance Type: nonrobust

Here’s what we need: F=20.23 and p=1.18e-16. Now, we can see that p-value is much less than 0.05, which means we reject our null hypothesis that there is no relation between severity of a person’s depression (or lack of it) and amount of ethanol consumed daily. It appears that there is some kind of association between those two variables. To find out which categories are associated, I’ll do some post hoc comparisons.

1) Importing necessary tools:

import statsmodels.stats.multicomp as multi

2) Doing a Tukey's HSD (Honestly Significant Difference) test

mc = multi.MultiComparison(data2['ETOTLCA2'],data2['S1Q213'])

res = mc.tukeyhsd()

print(res.summary())

Multiple Comparison of Means - Tukey HSD,FWER=0.05

=============================================

group1 group2 meandiff lower upper reject

---------------------------------------------

1.0 2.0 0.0092 -0.3321 0.3506 False

1.0 3.0 -0.3533 -0.6595 -0.0471 True

1.0 4.0 -0.4278 -0.7254 -0.1301 True

1.0 5.0 -0.5062 -0.8006 -0.2119 True

2.0 3.0 -0.3625 -0.5685 -0.1565 True

2.0 4.0 -0.437 -0.6301 -0.2439 True

2.0 5.0 -0.5155 -0.7033 -0.3276 True

3.0 4.0 -0.0745 -0.195 0.046 False

3.0 5.0 -0.153 -0.2649 -0.041 True

4.0 5.0 -0.0785 -0.1644 0.0074 False

---------------------------------------------

Which means that the significant differences are present between the categories 1 and 3, 4, 5; 2 and 3, 4, 5; 3 and 5.

What categories mean:

All of the time

Most of the time

Some of the time

A little of the time

None of the time

It appears that there is no significant difference between alcohol consumption of people who are depressed constantly and most of the time. Moreover, those who were sad a little of the time and not depressed at all are also likely to drink about the same amount.

On the other hand, all the other categories’ means were proven to be significantly different. Which means, for example, that drinking habits of a person that was depressed all the time are statistically different from habits of people who were sad for some time or less. To find out how much are they different exactly, let’s compare their means.

m1 = data2.groupby('S1Q213').mean()

print(m1)

sd1 = data2.groupby('S1Q213').std()

print(sd1)

ETOTLCA2

S1Q213

1.0 0.998476

2.0 1.007703

3.0 0.645213

4.0 0.570721

5.0 0.492227

ETOTLCA2

S1Q213

1.0 2.595513

2.0 4.989636

3.0 1.683092

4.0 2.868919

5.0 1.359950

As we can see, the more often they felt depressive during 4 weeks prior to the survey, the more they are likely to drink daily on average.

Summary:

ANOVA revealed that among people who consume ethanol, severity of their depression (or lack of it) (split into 5 ordered categories, which is the categorical explanatory variable) and amount of ethanol consumed daily on average (quantitative response variable) were significantly associated, F=20.23, p<0.0001.

Post hoc comparisons of mean amount of ethanol consumed with pairs of depression severity categories revealed that people who are more depressed are more likely to drink more. However, there was no significant difference in drinking levels of people depressed most and all of the time. The same is true for the two least depressed categories. So we could’ve probably reduced number of categories to three: very depressed, moderately depressed and slightly or not depressed, because we found out that significant differences were only found between these greater categories (very depressed people drink more than moderately depressed, who in turn drink more than slightly or not depressed at all), while no statistically important differences in means were found within those categories (depressed all the time and most of the time drink the same amount, the same is true for depressed for a little and not depressed at all).

0 notes