cattaria-blog
4 posts
Don't wanna be here? Send us removal request.
Text
Testing a potential moderator
In this task, I’ll try to figure out whether there is any connection between daily levels of cigarette and ethanol consumption (for those people who both smoke and drink) with relation to their potential moderator – depression. That means, are there any differences in association between daily levels of cigarette and ethanol consumption for depressed and not depressed people?
Here is some information about those variables.
S3AQ3B1 USUAL FREQUENCY WHEN SMOKED CIGARETTES
--------------------------------------
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
--------------------------------------
S3AQ3C1 USUAL QUANTITY WHEN SMOKED CIGARETTES
-------------------------------------
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
-------------------------------------
I am going to multiply these variables to find out monthly cigarette consumption, and then divide by 30 to calculate the average daily level.
ETOTLCA2
AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL
TYPES OF ALCOHOLIC BEVERAGES COMBINED
(NOTE: Users may wish to exclude outliers)
---------------------------------------------------------------
0.0003 - 219.9555 Ounces of ethanol
Possible moderator:
MAJORDEP12 MAJOR DEPRESSION IN LAST 12 MONTHS (NON-HIERARCHICAL)
-----------------------------------------------------
39608 0. No
3485 1. Yes
-----------------------------------------------------
Here are my steps:
1) Importing necessary libraries
import numpy
import pandas
import scipy
import seaborn
import matplotlib.pyplot as plt
2) Read the csv file of the dataset and fix the numeric datatypes
data = pandas.read_csv("nesarc.csv", low_memory=False)
data = data.apply(pandas.to_numeric, errors='coerce')
3) After examining the data, I found out that the variables had some missing values (replaced by ‘9’s and ‘99’s). These do not bear any meaning and thus useless for the analysis, so I replaced them with NaNs to delete them later.
data['S3AQ3B1']=data['S3AQ3B1'].replace(9, numpy.nan)
data['S3AQ3C1']=data['S3AQ3C1'].replace(99, numpy.nan)
4) Recoding variables for monthly smoking frequency and writing the new values into a new column
recode1 = {1: 30, 2: 22, 3: 14, 4: 6, 5: 2.5, 6: 1}
data['USFREQMO']= data['S3AQ3B1'].map(recode1)
5) Deleting NaN values (we can’t calculate correlation with empty values)
data2 = data[['ETOTLCA2','USFREQMO','S3AQ3C1']].dropna()
6) Calculating average daily cigarette consumption
data2['cigsperday'] = data2['USFREQMO']*data2['S3AQ3C1']/30
7) Deleting outliers that lie farther than 3 standard deviations from the mean
data2 = data2[numpy.abs(data2.ETOTLCA2-data2.ETOTLCA2.mean())<=(3*data2.ETOTLCA2.std())]
data2 = data2[numpy.abs(data2.cigsperday-data2.cigsperday.mean())<=(3*data2.cigsperday.std())]
8) Splitting the data into two parts: for depressed and not depressed people
nodep=data2[(data2['MAJORDEP12']== 0)]
dep=data2[(data2['MAJORDEP12']== 1)]
9) Plotting two subsets on a scatter plot
scat1 = seaborn.regplot(x="cigsperday", y="ETOTLCA2", fit_reg=True, data=nodep,label='without depression')
scat2 = seaborn.regplot(x="cigsperday", y="ETOTLCA2", fit_reg=True, data=dep,label='with depression')
plt.legend()
plt.xlabel('cigarettes')
plt.ylabel('Ethanol (ounces)')
plt.title('association between daily cigarette and ethanol consumption')
By looking at the picture, it can be said that distributions of answers for both groups look pretty similar. Lines of trend look almost identical, as well. But to prove it, we need to calculate Pearson’s correlation coefficients for both groups.
10) Calculating and printing Pearson’s r.
print ('association between daily cigarette and ethanol consumption')
print ('without depression',scipy.stats.pearsonr(nodep['cigsperday'], nodep['ETOTLCA2']))
print ('with depression',scipy.stats.pearsonr(dep['cigsperday'], dep['ETOTLCA2']))
association between daily cigarette and ethanol consumption
without depression (0.08127268338354372, 1.1026806253816316e-17)
with depression (0.060290856780329985, 0.031612603167820254)
We can see that both associations are significant (p<0.05), but not very strong: both Pearson’s coefficients do not exceed 0.08, which is relatively close to zero. So here we could say that there is almost no connection between daily cigarette and ethanol consumption, for both depressed people and those without it.
0 notes
Text
Pearson’s correlation coefficient
In this task, I’ll try to figure out whether there is any connection between daily levels of cigarette and ethanol consumption (for those people who both smoke and drink).
Here is some information about those variables.
S3AQ3B1 USUAL FREQUENCY WHEN SMOKED CIGARETTES
--------------------------------------
14836 1. Every day
460 2. 5 to 6 Day(s) a week
687 3. 3 to 4 Day(s) a week
747 4. 1 to 2 Day(s) a week
409 5. 2 to 3 Day(s) a month
772 6. Once a month or less
102 9. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
--------------------------------------
S3AQ3C1 USUAL QUANTITY WHEN SMOKED CIGARETTES
-------------------------------------
17751 1-98. Cigarette(s)
262 99. Unknown
25080 BL. NA, never or unknown if ever smoked 100+ cigarettes
-------------------------------------
I am going to multiply these variables to find out monthly cigarette consumption, and then divide by 30 to calculate the average daily level.
ETOTLCA2
AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL
TYPES OF ALCOHOLIC BEVERAGES COMBINED
(NOTE: Users may wish to exclude outliers)
---------------------------------------------------------------
0.0003 - 219.9555 Ounces of ethanol
To find out if there’s a linear connection between these variables, I will need to calculate Pearson’s r coefficient. It would be useful to plot the data on a scatter plot, too. Here are the steps to do these things:
1) Importing necessary libraries
import numpy
import pandas
import scipy
import seaborn
import matplotlib.pyplot as plt
2) Read the csv file of the dataset and fix the numeric datatypes
data = pandas.read_csv("nesarc.csv", low_memory=False)
data = data.apply(pandas.to_numeric, errors='coerce')
3) After examining the data, I found out that the variables had some missing values (replaced by ‘9’s and ‘99’s). These do not bear any meaning and thus useless for the analysis, so I replaced them with NaNs to delete them later.
data['S3AQ3B1']=data['S3AQ3B1'].replace(9, numpy.nan)
data['S3AQ3C1']=data['S3AQ3C1'].replace(99, numpy.nan)
4) Recoding variables for monthly smoking frequency and writing the new values into a new column
recode1 = {1: 30, 2: 22, 3: 14, 4: 6, 5: 2.5, 6: 1}
data['USFREQMO']= data['S3AQ3B1'].map(recode1)
5) Deleting NaN values (we can’t calculate correlation with empty values)
data2 = data[['ETOTLCA2','USFREQMO','S3AQ3C1']].dropna()
6) Calculating average daily cigarette consumption
data2['cigsperday'] = data2['USFREQMO']*data2['S3AQ3C1']/30
7) Plotting points on a scatter plot
scat1 = seaborn.regplot(x="cigsperday", y="ETOTLCA2", fit_reg=True, data=data2)
plt.xlabel('Cigarettes')
plt.ylabel('Ethanol (ounces)')
plt.title('association between daily cigarette and ethanol consumption')
It is hard to tell whether there is any kind of correlation between smoking and drinking levels or not, so let’s calculate Pearson’s correlation coefficient.
8) Calculating and printing Pearson’s r.
print ('association between daily cigarette and ethanol consumption')
print (scipy.stats.pearsonr(data2['cigsperday'], data2['ETOTLCA2']))
association between daily cigarette and ethanol consumption
(0.05521844230892321, 5.232596522904821e-10)
Even though the association is significant (p<0.0001) the connection is very weak at 0.055 (it is relatively close to zero), almost nonexistent. So we could say there is almost no connection between daily cigarette and ethanol consumption.
Additionally, we can calculate R-squared. It will help us determine the amount of variability in one variable that can be predicted by the other.
(scipy.stats.pearsonr(data2['cigsperday'], data2['ETOTLCA2'])[0])**2
0.0030490763710238813
R-squared here is about 0.3%, which is very little. Almost no variability of response variable can be explained by explanatory, which further proves our point of little to no association between the variables.
0 notes
Text
Chi-square test & post-hoc
In this task, I’ll try to find out if a connection exists between a person’s nicotine dependence and their mood (how often they felt depressive/downhearted during 4 weeks prior to the survey).
Here is some information about those variables.
S1Q213 DURING PAST 4 WEEKS, HOW OFTEN FELT DOWNHEARTED AND DEPRESSED
-------------------------------------------------------------
907 1. All of the time
2051 2. Most of the time
6286 3. Some of the time
11305 4. A little of the time
22127 5. None of the time
417 9. Unknown
--------------------------------------------------------------
TAB12MDX NICOTINE DEPENDENCE IN THE LAST 12 MONTHS
-----------------------------------------
38131 0. No nicotine dependence
4962 1. Nicotine dependence
-----------------------------------------
The null hypothesis is that there is no association between these two variables, which means that depression of a person and nicotine dependence are not connected in any way.
An alternative hypothesis is that there is some kind of a connection between these variables.
To test it, I’m going to do a chi-square test and later a post-hoc test.
Here’s what I did:
1) Importing necessary libraries
import numpy
import pandas
import scipy
import seaborn
import matplotlib.pyplot as plt
2) Read the csv file of the dataset and fix the numeric datatypes
data = pandas.read_csv("nesarc.csv", low_memory=False)
data = data.apply(pandas.to_numeric, errors='coerce')
3) After examining the data, I found out that the explanatory variable (S1Q213, DURING PAST 4 WEEKS, HOW OFTEN FELT DOWNHEARTED AND DEPRESSED) had some missing values (replaced by ‘9’s). These do not bear any meaning and thus useless for the analysis, so I replaced them with NaNs to delete them later.
data['S1Q213'].replace(9,numpy.nan, inplace=True)
4) Deleting rows with NaNs present in columns related to variables that I’ll use.
data2 = data[['TAB12MDX','S1Q213']].dropna()
5) Creating a contingency table of observed counts and printing it
ct1=pandas.crosstab(data2['TAB12MDX'], data2['S1Q213'])
print (ct1)
S1Q213 1.0 2.0 3.0 4.0 5.0
TAB12MDX
0 695 1613 5275 9834 20306
1 212 438 1011 1471 1821
6) Calculating column percentages and printing them
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
S1Q213 1.0 2.0 3.0 4.0 5.0
TAB12MDX
0 0.766262 0.786446 0.839166 0.869881 0.917702
1 0.233738 0.213554 0.160834 0.130119 0.082298
7) Calculating chi-square value, p-value and expected counts
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
chi-square value, p value, expected counts
(702.9278944655605, 8.09583837665881e-151, 4, array([[ 801.73308183, 1812.95981348, 5556.44338738, 9992.93549067, 19558.92822664], [ 105.26691817, 238.04018652, 729.55661262, 1312.06450933, 2568.07177336]]))
We can see that p-value is less than 0.05 so we conclude that there is an association between severity of depression and nicotine dependence. To figure out what kind of dependence, and between which categories, I’ll do pairwise chi-square test for all categories. To deflate the error, we’re going to use the Bonferroni adjustment: I will compare p-values not with the original 0.05 value, but with a value of 0.05 divided by a number of comparisons, which is 10.
So p must be less than 0.005 for results to be significant.
8) Finding number of categories for depressive state (here it would be 5).
n = data2.S1Q213.nunique()
9) Instead of writing each test by hand, I’m using a for-loop checking each combination of categories. I use a condition of i < j here so there won’t be any repeated pairs.
for i in range(1,n+1):
for j in range(1,n+1):
if i < j:
recode = {i:i, j:j}
data2['PAIRWISECOMP']= data2['S1Q213'].map(recode)
ct2=pandas.crosstab(data2['TAB12MDX'], data2['PAIRWISECOMP'])
#print(ct2)
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
cs2= scipy.stats.chi2_contingency(ct2)
print(colpct)
cs1= scipy.stats.chi2_contingency(ct1)
print('chi-square value:',cs1[0])
print ('p-value',cs1[1],"\n")
PAIRWISECOMP 1.0 2.0
TAB12MDX
0 0.766262 0.786446
1 0.233738 0.213554
chi-square value: 1.3787835738477932
p-value 0.24030845764454745
PAIRWISECOMP 1.0 3.0
TAB12MDX
0 0.766262 0.839166
1 0.233738 0.160834
chi-square value: 29.339003768193678
p-value 6.076032277226904e-08
PAIRWISECOMP 1.0 4.0
TAB12MDX
0 0.766262 0.869881
1 0.233738 0.130119
chi-square value: 74.99963145388048
p-value 4.7080193470524695e-18
PAIRWISECOMP 1.0 5.0
TAB12MDX
0 0.766262 0.917702
1 0.233738 0.082298
chi-square value: 246.4365273816259
p-value 1.5535697740039353e-55
PAIRWISECOMP 2.0 3.0
TAB12MDX
0 0.786446 0.839166
1 0.213554 0.160834
chi-square value: 29.567080989009575
p-value 5.401454388734273e-08
PAIRWISECOMP 2.0 4.0
TAB12MDX
0 0.786446 0.869881
1 0.213554 0.130119
chi-square value: 97.97325205524267
p-value 4.2407233316281255e-23
PAIRWISECOMP 2.0 5.0
TAB12MDX
0 0.786446 0.917702
1 0.213554 0.082298
chi-square value: 380.2332811381517
p-value 1.1070629481782638e-84
PAIRWISECOMP 3.0 4.0
TAB12MDX
0 0.839166 0.869881
1 0.160834 0.130119
chi-square value: 31.19382508236924
p-value 2.3350766378586633e-08
PAIRWISECOMP 3.0 5.0
TAB12MDX
0 0.839166 0.917702
1 0.160834 0.082298
chi-square value: 335.5906941492885
p-value 5.823139114082824e-75
PAIRWISECOMP 4.0 5.0
TAB12MDX
0 0.869881 0.917702
1 0.130119 0.082298
chi-square value: 192.21580637739942
p-value 1.043957307193602e-43
PAIRWISECOMP 3.0 5.0
TAB12MDX
0 0.839166 0.917702
1 0.160834 0.082298
chi-square value: 702.9278944655605
p-value 8.09583837665881e-151
PAIRWISECOMP 4.0 5.0
TAB12MDX
0 0.869881 0.917702
1 0.130119 0.082298
chi-square value: 702.9278944655605
p-value 8.09583837665881e-151
Almost all of the pairs are significantly different. The only pair that is not is 1 and 2: those are the people who felt depressed all the time (1) and most of the time (2). All of the other groups are statistically different from each other. To see how are they different, exactly, let’s plot the values.
seaborn.catplot(x="S1Q213", y="TAB12MDX", data=data2,kind='bar', ci=None)
plt.xlabel('Severity of depressive thoughts (from most to least)')
plt.ylabel('Nicotine dependence')
We can see a clear downward trend in these data, which means that people who felt depressed more often are more likely to have a nicotine dependence. The two of the most depressed categories, however, are statistically similar.
0 notes
Text
ANOVA & post hoc
In this task, I’ll try to find out if connection exists between average daily ethanol consumption (in ounces) of a person and their mood (how often they felt depressive/downhearted during 4 weeks prior to the survey).
Here is some information about those variables.
Explanatory variable: categorical, 5 categories (unknown values will be deleted since they do not give any useful information).
S1Q213 DURING PAST 4 WEEKS, HOW OFTEN FELT DOWNHEARTED AND DEPRESSED
1. All of the time
2. Most of the time
3. Some of the time
4. A little of the time
5. None of the time
9. Unknown
Response variable, quantitative (blanks to be deleted later).
ETOTLCA2 AVERAGE DAILY VOLUME OF ETHANOL CONSUMED IN PAST YEAR, FROM ALL
TYPES OF ALCOHOLIC BEVERAGES COMBINED
0.0003 - 219.9555 Ounces of ethanol
Blank Unknown
Let’s formulate our null hypothesis first: there is no relation between presence of depression ( and severity of it) and their drinking habits ( = all means are equal).
An alternative hypothesis would be that there is some kind of a relation between daily ethanol consumption and depression.
To test the hypothesis, I conducted an Analysis of Variance with Python.
In order to do that, I did the following:
1) Importing necessary libraries
import numpy
import pandas
import statsmodels.formula.api as smf
2) Read the csv file of the dataset and fix the numeric datatypes
data = pandas.read_csv("nesarc.csv", low_memory=False)
data = data.apply(pandas.to_numeric, errors='coerce')
3) After examining the data, I found out that the explanatory variable (S1Q213, DURING PAST 4 WEEKS, HOW OFTEN FELT DOWNHEARTED AND DEPRESSED) had some missing values (replaced by ‘9’s). These do not bear any meaning and thus useless for the analysis, so I replaced them with NaNs to delete them later.
data['S1Q213'].replace(9, numpy.nan, inplace=True)
4) Deleting rows with NaNs present in columns related to variables that I’ll use.
data2 = data[['ETOTLCA2','S1Q213']].dropna()
5) Creating an ordinary least squares regression model to find out the needed F-statistic and p-value, fitting and printing the summary of the model.
model2 = smf.ols(formula = "ETOTLCA2 ~ C(S1Q213)", data = data2)
results2=model2.fit()
print(results2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ETOTLCA2 R-squared: 0.003
Model: OLS Adj. R-squared: 0.003
Method: Least Squares F-statistic: 20.23
Date: Sun, 21 Oct 2018 Prob (F-statistic): 1.18e-16
Time: 02:11:35 Log-Likelihood: -58890.
No. Observations: 26598 AIC: 1.178e+05
Df Residuals: 26593 BIC: 1.178e+05
Df Model: 4
Covariance Type: nonrobust
Here’s what we need: F=20.23 and p=1.18e-16. Now, we can see that p-value is much less than 0.05, which means we reject our null hypothesis that there is no relation between severity of a person’s depression (or lack of it) and amount of ethanol consumed daily. It appears that there is some kind of association between those two variables. To find out which categories are associated, I’ll do some post hoc comparisons.
1) Importing necessary tools:
import statsmodels.stats.multicomp as multi
2) Doing a Tukey's HSD (Honestly Significant Difference) test
mc = multi.MultiComparison(data2['ETOTLCA2'],data2['S1Q213'])
res = mc.tukeyhsd()
print(res.summary())
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff lower upper reject
---------------------------------------------
1.0 2.0 0.0092 -0.3321 0.3506 False
1.0 3.0 -0.3533 -0.6595 -0.0471 True
1.0 4.0 -0.4278 -0.7254 -0.1301 True
1.0 5.0 -0.5062 -0.8006 -0.2119 True
2.0 3.0 -0.3625 -0.5685 -0.1565 True
2.0 4.0 -0.437 -0.6301 -0.2439 True
2.0 5.0 -0.5155 -0.7033 -0.3276 True
3.0 4.0 -0.0745 -0.195 0.046 False
3.0 5.0 -0.153 -0.2649 -0.041 True
4.0 5.0 -0.0785 -0.1644 0.0074 False
---------------------------------------------
Which means that the significant differences are present between the categories 1 and 3, 4, 5; 2 and 3, 4, 5; 3 and 5.
What categories mean:
All of the time
Most of the time
Some of the time
A little of the time
None of the time
It appears that there is no significant difference between alcohol consumption of people who are depressed constantly and most of the time. Moreover, those who were sad a little of the time and not depressed at all are also likely to drink about the same amount.
On the other hand, all the other categories’ means were proven to be significantly different. Which means, for example, that drinking habits of a person that was depressed all the time are statistically different from habits of people who were sad for some time or less. To find out how much are they different exactly, let’s compare their means.
m1 = data2.groupby('S1Q213').mean()
print(m1)
sd1 = data2.groupby('S1Q213').std()
print(sd1)
ETOTLCA2
S1Q213
1.0 0.998476
2.0 1.007703
3.0 0.645213
4.0 0.570721
5.0 0.492227
ETOTLCA2
S1Q213
1.0 2.595513
2.0 4.989636
3.0 1.683092
4.0 2.868919
5.0 1.359950
As we can see, the more often they felt depressive during 4 weeks prior to the survey, the more they are likely to drink daily on average.
Summary:
ANOVA revealed that among people who consume ethanol, severity of their depression (or lack of it) (split into 5 ordered categories, which is the categorical explanatory variable) and amount of ethanol consumed daily on average (quantitative response variable) were significantly associated, F=20.23, p<0.0001.
Post hoc comparisons of mean amount of ethanol consumed with pairs of depression severity categories revealed that people who are more depressed are more likely to drink more. However, there was no significant difference in drinking levels of people depressed most and all of the time. The same is true for the two least depressed categories. So we could’ve probably reduced number of categories to three: very depressed, moderately depressed and slightly or not depressed, because we found out that significant differences were only found between these greater categories (very depressed people drink more than moderately depressed, who in turn drink more than slightly or not depressed at all), while no statistically important differences in means were found within those categories (depressed all the time and most of the time drink the same amount, the same is true for depressed for a little and not depressed at all).
0 notes