ANOVA and Hypothesis Testing.
The Syntax:
-- coding: utf-8 --
"""
Created on Wed Sep 14 16:20:39 2022
@author: Aujasvi Sulekh
"""
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
//Load the NESARC dataset
nesarc = pd.read_csv ('NESARC Pds.csv' , low_memory=False)
//Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)
//Set PANDAS to show all rows in DataFrame
pd.set_option('display.max_rows', None)
nesarc.columns = map(str.upper , nesarc.columns)
pd.set_option('display.float_format' , lambda x:'%f'%x)
//Change variables to numeric
nesarc['AGE'] = pd.to_numeric(nesarc['AGE'], errors='coerce')
nesarc['S3BQ4'] = pd.to_numeric(nesarc['S3BQ4'], errors='coerce')
nesarc['S3BQ1A5'] = pd.to_numeric(nesarc['S3BQ1A5'], errors='coerce')
nesarc['S3BD5Q2B'] = pd.to_numeric(nesarc['S3BD5Q2B'], errors='coerce')
nesarc['S3BD5Q2E'] = pd.to_numeric(nesarc['S3BD5Q2E'], errors='coerce')
nesarc['MAJORDEP12'] = pd.to_numeric(nesarc['MAJORDEP12'], errors='coerce')
nesarc['GENAXDX12'] = pd.to_numeric(nesarc['GENAXDX12'], errors='coerce')
//Subset the sample
sampleset = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30) & (nesarc['S3BQ1A5']==1)]
//Cannabis users, ages 18-30
samplecopy = sampleset.copy()
//Setting missing data for quantity of cannabis (measured in joints), variable S3BQ4
samplecopy['S3BQ4'] = samplecopy['S3BQ4'].replace(99, np.nan)
samplecopy['S3BQ4'] = samplecopy['S3BQ4'].replace('BL', np.nan)
//Creating a new subset while removing the null values
newsub_a = samplecopy[['S3BQ4', 'MAJORDEP12']].dropna()
//Using ols function for calculating the F-statistic and the associated p value
//Depression (categorical, explanatory variable) and joints quantity (quantitative, response variable) correlation
model_a = smf.ols(formula='S3BQ4 ~ C(MAJORDEP12)', data = newsub_a)
result_a = model_a.fit()
print(result_a.summary())
//Measure mean and spread for categorical variable MAJORDEP12, major depression
print ('Means for joints quantity by major depression status')
mean_a = newsub_a.groupby('MAJORDEP12').mean()
print(mean_a)
print ('Standard deviations for joints quantity by major depression status')
sdev_a = newsub_a.groupby('MAJORDEP12').std()
print(sdev_a)
//Creating a new subset while removing the null values
newsub_b = samplecopy[['S3BQ4', 'GENAXDX12']].dropna()
//Using ols function for calculating the F-statistic and the associated p value
//Anxiety (categorical, explanatory variable) and joints quantity (quantitative, response variable) correlation
model_b = smf.ols(formula='S3BQ4 ~ C(GENAXDX12)', data = newsub_b)
result_b = model_b.fit()
print(result_b.summary())
//Measure mean and spread for categorical variable GENAXDX12, general anxiety
print ('Means for joints quantity by major general anxiety status')
mean_b = newsub_b.groupby('GENAXDX12').mean()
print(mean_b)
print ('Standard deviations for joints quantity by general anxiety status')
sdev_b = newsub_b.groupby('GENAXDX12').std()
print(sdev_b)
//Setting missing data for frequency of cannabis use, variable S3BD5Q2E
samplecopy['S3BD5Q2E'] = samplecopy['S3BD5Q2E'].replace(99, np.nan)
samplecopy['S3BD5Q2E'] = samplecopy['S3BD5Q2E'].replace('BL', np.nan)
newsub_c = samplecopy[['S3BQ4', 'S3BD5Q2E']].dropna()
//Using ols function for calculating the F-statistic and associated p value
//Frequency of cannabis use (10 level categorical, explanatory variable) and joints quantity (quantitative, response variable) correlation
model_c = smf.ols(formula='S3BQ4 ~ C(S3BD5Q2E)', data = newsub_c)
result_c = model_c.fit()
print(result_c.summary())
//Measure mean and spread for categorical variable S3BD5Q2E, frequency of cannabis use
print('Means for joints quantity by frequency of cannabis use status')
//cannabis_use_mean= newsub_b.groupby('S3BD5Q2E').mean()
cannabis_use_mean= samplecopy['S3BD5Q2E'].mean()
print(cannabis_use_mean)
print('Standard deviations for joints quantity by frequency of cannabis use status')
cannabis_use_sdev = newsub_c.groupby('S3BD5Q2E').std()
print(cannabis_use_sdev)
//Run a post hoc test (paired comparisons), using Tukey HSDT
multicmp = multi.MultiComparison(newsub_c['S3BQ4'], newsub_c['S3BD5Q2E'])
result = multicmp.tukeyhsd()
print(result.summary())
Theory and Introduction:
Analysis of variance, or ANOVA, is a statistical method that separates observed variance data into different components to use for additional tests. A one-way ANOVA is used for three or more groups of data, to gain information about the relationship between the dependent and independent variables.
The data was provided by the National Epidemiological Survey on Alcohol and Related Conditions (NESARC), which was conducted in a random sample of 43,093 U.S. adults and designed to determine the magnitude of alcohol use and psychiatric disorders. Sample size is important because the larger the sample size, the more accurate the findings. NESARC’s unusually large sample size also made it possible to achieve stable estimates of even rare conditions. NESARC participants came from all walks of life and a variety of ages, and the level of analysis studied was individual. They represented all regions of the United States and included residents of the District of Columbia, Alaska, and Hawaii. In addition to sampling individuals living in traditional households, NESARC investigators questioned military personnel living off base and people living in a variety of group accommodations such as boarding or rooming houses and college quarters. More specifically, the sample consists of 24,575 (57.1%) males and 18,518 (42.9%) females, among of whom 9,535 (22.13%) were aged between 18 and 30 years old. The data analytic subset, examined in this study, includes individuals aged between 18 and 30 years old who reported using cannabis at least once in their life (N=2,412).
For the comments '#' has been replaced by '//' in the syntax.
The Syntax:
-- coding: utf-8 --
"""
Created on Wed Sep 14 16:20:39 2022
@author: Aujasvi Sulekh
"""
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
//Load the NESARC dataset
nesarc = pd.read_csv ('NESARC Pds.csv' , low_memory=False)
//Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)
//Set PANDAS to show all rows in DataFrame
pd.set_option('display.max_rows', None)
nesarc.columns = map(str.upper , nesarc.columns)
pd.set_option('display.float_format' , lambda x:'%f'%x)
//Change variables to numeric
nesarc['AGE'] = pd.to_numeric(nesarc['AGE'], errors='coerce')
nesarc['S3BQ4'] = pd.to_numeric(nesarc['S3BQ4'], errors='coerce')
nesarc['S3BQ1A5'] = pd.to_numeric(nesarc['S3BQ1A5'], errors='coerce')
nesarc['S3BD5Q2B'] = pd.to_numeric(nesarc['S3BD5Q2B'], errors='coerce')
nesarc['S3BD5Q2E'] = pd.to_numeric(nesarc['S3BD5Q2E'], errors='coerce')
nesarc['MAJORDEP12'] = pd.to_numeric(nesarc['MAJORDEP12'], errors='coerce')
nesarc['GENAXDX12'] = pd.to_numeric(nesarc['GENAXDX12'], errors='coerce')
//Subset the sample
sampleset = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30) & (nesarc['S3BQ1A5']==1)]
//Cannabis users, ages 18-30
samplecopy = sampleset.copy()
//Setting missing data for quantity of cannabis (measured in joints), variable S3BQ4
samplecopy['S3BQ4'] = samplecopy['S3BQ4'].replace(99, np.nan)
samplecopy['S3BQ4'] = samplecopy['S3BQ4'].replace('BL', np.nan)
//Creating a new subset while removing the null values
newsub_a = samplecopy[['S3BQ4', 'MAJORDEP12']].dropna()
//Using ols function for calculating the F-statistic and the associated p value
//Depression (categorical, explanatory variable) and joints quantity (quantitative, response variable) correlation
model_a = smf.ols(formula='S3BQ4 ~ C(MAJORDEP12)', data = newsub_a)
result_a = model_a.fit()
print(result_a.summary())
//Measure mean and spread for categorical variable MAJORDEP12, major depression
print ('Means for joints quantity by major depression status')
mean_a = newsub_a.groupby('MAJORDEP12').mean()
print(mean_a)
print ('Standard deviations for joints quantity by major depression status')
sdev_a = newsub_a.groupby('MAJORDEP12').std()
print(sdev_a)
//Creating a new subset while removing the null values
newsub_b = samplecopy[['S3BQ4', 'GENAXDX12']].dropna()
//Using ols function for calculating the F-statistic and the associated p value
//Anxiety (categorical, explanatory variable) and joints quantity (quantitative, response variable) correlation
model_b = smf.ols(formula='S3BQ4 ~ C(GENAXDX12)', data = newsub_b)
result_b = model_b.fit()
print(result_b.summary())
//Measure mean and spread for categorical variable GENAXDX12, general anxiety
print ('Means for joints quantity by major general anxiety status')
mean_b = newsub_b.groupby('GENAXDX12').mean()
print(mean_b)
print ('Standard deviations for joints quantity by general anxiety status')
sdev_b = newsub_b.groupby('GENAXDX12').std()
print(sdev_b)
//Setting missing data for frequency of cannabis use, variable S3BD5Q2E
samplecopy['S3BD5Q2E'] = samplecopy['S3BD5Q2E'].replace(99, np.nan)
samplecopy['S3BD5Q2E'] = samplecopy['S3BD5Q2E'].replace('BL', np.nan)
newsub_c = samplecopy[['S3BQ4', 'S3BD5Q2E']].dropna()
//Using ols function for calculating the F-statistic and associated p value
//Frequency of cannabis use (10 level categorical, explanatory variable) and joints quantity (quantitative, response variable) correlation
model_c = smf.ols(formula='S3BQ4 ~ C(S3BD5Q2E)', data = newsub_c)
result_c = model_c.fit()
print(result_c.summary())
//Measure mean and spread for categorical variable S3BD5Q2E, frequency of cannabis use
print('Means for joints quantity by frequency of cannabis use status')
//cannabis_use_mean= newsub_b.groupby('S3BD5Q2E').mean()
cannabis_use_mean= samplecopy['S3BD5Q2E'].mean()
print(cannabis_use_mean)
print('Standard deviations for joints quantity by frequency of cannabis use status')
cannabis_use_sdev = newsub_c.groupby('S3BD5Q2E').std()
print(cannabis_use_sdev)
//Run a post hoc test (paired comparisons), using Tukey HSDT
multicmp = multi.MultiComparison(newsub_c['S3BQ4'], newsub_c['S3BD5Q2E'])
result = multicmp.tukeyhsd()
print(result.summary())
The Output:
1 note
·
View note