timothy-mokoka - Tumblr blog

timothy-mokoka · 2 years ago

Text

Testing a Potential Moderator

Introduction:

This assignment examines a 2412 sample of Marijuana / Cannabis users from the NESRAC dataset between the ages of 18 and 30. My Research question is as follows:

Is the number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 the leading cause of mental health disorders such as depression and anxiety?

My Hypothesis Test statements are as follows:

H0: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is not the leading cause of mental health disorders such as depression and anxiety.

Ha: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is the leading cause of mental health disorders such as depression and anxiety.

The third variable that is introduced to the mix as a moderator is the “S1Q231” categorical variable, which indicates the total number of people who lost a close friend or family member in the last 12 months in the NESARC dataset. This variable affects the strength and direction of the relationship between the explanatory and response variables statistically that helps us to understand the moderator.

Explanation of Code:

Since I have a categorical explanatory variable (frequency of cannabis use) and a categorical response variable (major depression), I ran a Chi-square Test of Independence (crosstab function) to examine the patterns of the association between them (C->C), by directly measuring the chi-square value and the p-value. In addition, in order visualise graphically this association, I used a factorplot function (seaborn library) to produce a bivariate graph. Furthermore, in order to determine which frequency groups are different from the others, I performed a post hoc test, using Bonferroni Adjustment approach, since my explanatory variable has more than 2 levels. In the case of ten groups, I actually need to conduct 45 pair wise comparisons, but in fact I examined indicatively two and compared their p-values with the Bonferroni adjusted p-value, which is calculated by dividing p=0.05 by 45. By this way it is possible to identify the situations where null hypothesis can be safely rejected without making an excessive type 1 error.

Regarding the third variable, I examined if the fact that a family member or a close friend died in the last 12 months, moderates the significant association between cannabis use frequency and major depression diagnosis. Put it another way, is frequency of cannabis use related to major depression for each level of the moderating variable (1=Yes and 2=No), that is for those whose a family member or a close friend died in the last 12 months and for those whose they did not? Therefore, I set new data frames (sub1 and sub2) that include either individuals who fell into each category (Yes or No) and ran a Chi-square Test of Independence for each subgroup separately, measuring both chi-square values and p-values. Finally, with factorplot function (seaborn library) I created two bivariate line graphs, one for each level of the moderating variable, in order to visualise the differences and the effect of the moderator upon the statistical relationship between frequency of cannabis use and major depression diagnosis.

Code / Syntax:

-- coding: utf-8 --

""" Created on Mon Apr 03 18:11:22 2023

@author: Oteng """

import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt

nesarc = pandas.read_csv ('nesarc_pds.csv', low_memory=False)

Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns' , None)

Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows' , None)

nesarc.columns = map(str.upper , nesarc.columns)

pandas.set_option('display.float_format' , lambda x:'%f'%x)

Change my variables to numeric

nesarc['AGE'] = nesarc['AGE'].convert_objects(convert_numeric=True) nesarc['MAJORDEP12'] = nesarc['MAJORDEP12'].convert_objects(convert_numeric=True) nesarc['S1Q231'] = nesarc['S1Q231'].convert_objects(convert_numeric=True) nesarc['S3BQ1A5'] = nesarc['S3BQ1A5'].convert_objects(convert_numeric=True) nesarc['S3BD5Q2E'] = nesarc['S3BD5Q2E'].convert_objects(convert_numeric=True)

Subset my sample

subset1 = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30) & nesarc['S3BQ1A5']==1] # Ages 18-30, cannabis users subsetc1 = subset1.copy()

Setting missing data

subsetc1['S1Q231']=subsetc1['S1Q231'].replace(9, numpy.nan) subsetc1['S3BQ1A5']=subsetc1['S3BQ1A5'].replace(9, numpy.nan) subsetc1['S3BD5Q2E']=subsetc1['S3BD5Q2E'].replace(99, numpy.nan) subsetc1['S3BD5Q2E']=subsetc1['S3BD5Q2E'].replace('BL', numpy.nan)

recode1 = {1: 9, 2: 8, 3: 7, 4: 6, 5: 5, 6: 4, 7: 3, 8: 2, 9: 1} # Frequency of cannabis use variable reverse-recode subsetc1['CUFREQ'] = subsetc1['S3BD5Q2E'].map(recode1) # Change the variable name from S3BD5Q2E to CUFREQ

subsetc1['CUFREQ'] = subsetc1['CUFREQ'].astype('category')

Renames graph labels for better interpetation

subsetc1['CUFREQ'] = subsetc1['CUFREQ'].cat.rename_categories(["2 times/year","3-6 times/year","7-11 times/year","Once a month","2-3 times/month","1-2 times/week","3-4 times/week","Nearly every day","Every day"])

Contingency table of observed counts of major depression diagnosis (response variable) within frequency of cannabis use groups (explanatory variable), in ages 18-30

contab1 = pandas.crosstab(subsetc1['MAJORDEP12'], subsetc1['CUFREQ']) print (contab1)

Column of percentages

colsum=contab1.sum(axis=0) colpcontab=contab1/colsum print(colpcontab)

Chi-square calculations for major depression within frequency of cannabis use groups

print ('Chi-square value, p value, expected counts, for major depression within cannabis use status') chsq1= scipy.stats.chi2_contingency(contab1) print (chsq1)

Bivariate bar graph for major depression percentages with each cannabis smoking frequency group

plt.figure(figsize=(12,4)) # Change plot size ax1 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=subsetc1, kind="bar", ci=None) ax1.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.show()

recode2 = {1: 10, 2: 9, 3: 8, 4: 7, 5: 6, 6: 5, 7: 4, 8: 3, 9: 2, 10: 1} # Frequency of cannabis use variable reverse-recode subsetc1['CUFREQ2'] = subsetc1['S3BD5Q2E'].map(recode2) # Change the variable name from S3BD5Q2E to CUFREQ2

sub1=subsetc1[(subsetc1['S1Q231']== 1)] sub2=subsetc1[(subsetc1['S1Q231']== 2)]

print ('Association between cannabis use status and major depression for those who lost a family member or a close friend in the last 12 months') contab2=pandas.crosstab(sub1['MAJORDEP12'], sub1['CUFREQ2']) print (contab2)

Column of percentages

colsum2=contab2.sum(axis=0) colpcontab2=contab2/colsum2 print(colpcontab2)

Chi-square

print ('Chi-square value, p value, expected counts') chsq2= scipy.stats.chi2_contingency(contab2) print (chsq2)

Line graph for major depression percentages within each frequency group, for those who lost a family member or a close friend

plt.figure(figsize=(12,4)) # Change plot size ax2 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=sub1, kind="point", ci=None) ax2.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.title('Association between cannabis use status and major depression for those who lost a family member or a close friend in the last 12 months') plt.show()

print ('Association between cannabis use status and major depression for those who did NOT lose a family member or a close friend in the last 12 months') contab3=pandas.crosstab(sub2['MAJORDEP12'], sub2['CUFREQ2']) print (contab3)

Column of percentages

colsum3=contab3.sum(axis=0) colpcontab3=contab3/colsum3 print(colpcontab3)

Chi-square

print ('Chi-square value, p value, expected counts') chsq3= scipy.stats.chi2_contingency(contab3) print (chsq3)

Line graph for major depression percentages within each frequency group, for those who did NOT lose a family member or a close friend

plt.figure(figsize=(12,4)) # Change plot size ax3 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=sub2, kind="point", ci=None) ax3.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.title('Association between cannabis use status and major depression for those who did NOT lose a family member or a close friend in the last 12 months') plt.show()

Output: Moderator

This is the moderating variable that I used for the statistical interaction:

Output1: 1st Chi-Square

A Chi Square test of independence revealed that among cannabis users aged between 18 and 30 years old (subsetc1), the frequency of cannabis use (explanatory variable collapsed into 9 ordered categories) and past year depression diagnosis (response binary categorical variable) were significantly associated, X2 =29.83, 8 df, p=0.00022.

Output2: Bar Graph

In the Bar Graph presented above, we can see the correlation between frequency of cannabis use (explanatory variable) and major depression diagnosis in the past year (response variable). Obviously, we have a left-skewed distribution, which indicates that the more an individual (18-30) smoked cannabis, the better were the chances to have experienced depression in the last 12 months.

Output3: 2nd Chi-Square

In the first place, for the moderating variable equal to 1, which is those whose family member or a close friend died in the last 12 months (sub1), a Chi Square test of independence revealed that among cannabis users aged between 18 and 30 years old, the frequency of cannabis use (explanatory variable) and past year depression diagnosis (response variable) were not significantly associated, X2 =4.61, 9 df, p=0.86. As a result, since the chi-square value is quite small and the p-value is significantly large, we can assume that there is no statistical relationship between these two variables, when taking into account the subgroup of individuals who lost a family member or a close friend in the last 12 months.

Output4: Line Graph

In the bivariate Line Graph (C->C) presented above, we can see the correlation between frequency of cannabis use (explanatory variable) and major depression diagnosis in the past year (response variable), in the subgroup of individuals whose a family member or a close friend died in the last 12 months (sub1). In fact, the direction of the distribution (fluctuation) does not indicate a positive relationship between these two variables, for those who experienced a family/close death in the past year.

Output5: 3rd Chi-Square

Subsequently, for the moderating variable equal to 2, which is those whose family member or a close friend did not die in the last 12 months (sub2), a Chi Square test of independence revealed that among cannabis users aged between 18 and 30 years old, the frequency of cannabis use (explanatory variable) and past year depression diagnosis (response variable) were significantly associated, X2 =37.02, 9 df, p=2.6e-05 (p-value is written in scientific notation). As a result, since the chi-square value is quite large and the p-value is significantly small, we can assume that there is a positive relationship between these two variables, when taking into account the subgroup of individuals who did not lose a family member or a close friend in the last 12 months.

Output6: Line Graph

In the bivariate Line graph (C->C) presented above, we can see the correlation between frequency of cannabis use (explanatory variable) and major depression diagnosis in the past year (response variable), in the subgroup of individuals whose a family member or a close friend did not die in the last 12 months (sub2). Obviously, the direction of the distribution indicates a positive relationship between these two variables, which means that the frequency of cannabis use directly affects the proportions of major depression, regarding the individuals who did not experience a family/close death in the last 12 months.

Conclusion:

It seems that both the direction and the size of the relationship between frequency of cannabis use and major depression diagnosis in the last 12 months, is heavily affected by a death of a family member or a close friend in the same period. In other words, when the incident of a family/close death is present, the correlation is considerably weak, whereas when it is absent, the correlation is significantly strong and positive. Thus, the third variable moderates the association between cannabis use frequency and major depression diagnosis.

0 notes

timothy-mokoka · 2 years ago

Text

Hypothesis Testing with Pearson Correlation

Introduction:

This assignment examines a 2412 sample of Marijuana / Cannabis users from the NESRAC dataset between the ages of 18 and 30. My Research question is as follows:

Is the number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 the leading cause of mental health disorders such as depression and anxiety?

My Hypothesis Test statements are as follows:

H0: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is not the leading cause of mental health disorders such as depression and anxiety.

Ha: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is the leading cause of mental health disorders such as depression and anxiety.

Explanation of the Code:

My research question only categorical variables but for this Pearson Correlation test I have selected three different quantitative variables from the NESARC dataset. Thus, I have refined the hypothesis and examined the correlation between age with the people that have been using cannabis the most, which is the quantitative explanatory variable (‘S3BD5Q2F’) and the age when they experienced their first episode of general anxiety and major depression, which are the quantitative response variables (‘S9Q6A’) and (‘S4AQ6A’).

For visualizing the relationship and association between cannabis use and general anxiety and major depression episodes, I used the seaborn library to produce scatterplots for each of the mental health disorders separately and the interpretation thereof, by describing the direction as well as the strength and form of the relationships. Additionally I ran a Pearson correlation test twice, one for each mental health disorder, and measured the strength of the relationships between each of the quantitative variables by generating the correlation coefficients “r” and their associated p-values.

Code / Syntax:

-- coding: utf-8 --

""" Created on Mon Apr 2 15:00:39 2023

@author: Oteng """

import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt

nesarc = pandas.read_csv ('nesarc_pds.csv' , low_memory=False)

Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns', None)

Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows', None)

nesarc.columns = map(str.upper , nesarc.columns)

pandas.set_option('display.float_format' , lambda x:'%f'%x)

Change my variables variables of interest to numeric

nesarc['AGE'] = pandas.to_numeric(nesarc['AGE'], errors='coerce') nesarc['S3BQ4'] = pandas.to_numeric(nesarc['S3BQ4'], errors='coerce') nesarc['S4AQ6A'] = pandas.to_numeric(nesarc['S4AQ6A'], errors='coerce') nesarc['S3BD5Q2F'] = pandas.to_numeric(nesarc['S3BD5Q2F'], errors='coerce') nesarc['S9Q6A'] = pandas.to_numeric(nesarc['S9Q6A'], errors='coerce') nesarc['S4AQ7'] = pandas.to_numeric(nesarc['S4AQ7'], errors='coerce') nesarc['S3BQ1A5'] = pandas.to_numeric(nesarc['S3BQ1A5'], errors='coerce')

Subset my sample

subset1 = nesarc[(nesarc['S3BQ1A5']==1)] #Cannabis users subsetc1 = subset1.copy()

Setting missing data

subsetc1['S3BQ1A5']=subsetc1['S3BQ1A5'].replace(9, numpy.nan) subsetc1['S3BD5Q2F']=subsetc1['S3BD5Q2F'].replace('BL', numpy.nan) subsetc1['S3BD5Q2F']=subsetc1['S3BD5Q2F'].replace(99, numpy.nan) subsetc1['S4AQ6A']=subsetc1['S4AQ6A'].replace('BL', numpy.nan) subsetc1['S4AQ6A']=subsetc1['S4AQ6A'].replace(99, numpy.nan) subsetc1['S9Q6A']=subsetc1['S9Q6A'].replace('BL', numpy.nan) subsetc1['S9Q6A']=subsetc1['S9Q6A'].replace(99, numpy.nan)

Scatterplot for the age when began using cannabis the most and the age of first episode of major depression

plt.figure(figsize=(12,4)) # Change plot size scat1 = seaborn.regplot(x="S3BD5Q2F", y="S4AQ6A", fit_reg=True, data=subset1) plt.xlabel('Age when began using cannabis the most') plt.ylabel('Age when expirenced the first episode of major depression') plt.title('Scatterplot for the age when began using cannabis the most and the age of first the episode of major depression') plt.show()

data_clean=subset1.dropna()

Pearson correlation coefficient for the age when began using cannabis the most and the age of first the episode of major depression

print ('Association between the age when began using cannabis the most and the age of the first episode of major depression') print (scipy.stats.pearsonr(data_clean['S3BD5Q2F'], data_clean['S4AQ6A']))

Scatterplot for the age when began using cannabis the most and the age of the first episode of general anxiety

plt.figure(figsize=(12,4)) # Change plot size scat2 = seaborn.regplot(x="S3BD5Q2F", y="S9Q6A", fit_reg=True, data=subset1) plt.xlabel('Age when began using cannabis the most') plt.ylabel('Age when expirenced the first episode of general anxiety') plt.title('Scatterplot for the age when began using cannabis the most and the age of the first episode of general anxiety') plt.show()

Pearson correlation coefficient for the age when began using cannabis the most and the age of the first episode of general anxiety

print ('Association between the age when began using cannabis the most and the age of first the episode of general anxiety') print (scipy.stats.pearsonr(data_clean['S3BD5Q2F'], data_clean['S9Q6A']))

Output 1

Pearson Correlation test results are as follows:

Output 2:

The scatterplot illustrates the relationship and correlation between the age individuals started using cannabis the most, a quantitative explanatory variable, and the age when they started experiencing their first major depression episode, a quantitative response variable. The direction is a positively increasing relationship; as the age when individual began using cannabis the most increases, the age when they experience their first major depression episode increases. From the Pearson Correlation test, which resulted in a correlation of coefficient of 0.23, indicates a weak positive linear relationship between the two quantitative variables of interest. The associated p-value is 2.27e-09 which is significantly small. This means that the relationship is statistically significant and indicates that the association between the two quantitative variables of interest is weak.

Output 3:

From the scatterplot above the association between the age of when individuals began using cannabis the most, quantitative explanatory variable, and the age when they experience their first general anxiety episode, a quantitative response variable. The direction is a positive linear relationship. The Pearson Correlation test, which resulted in a correlation coefficient of 0.1494, which indicates a weak positive linear relationship between the two quantitative variables. The associated p-value is 0.00012 which indicates a statistically significant relationship. Thus the relationship between the age of when individuals began using cannabis the most and the age when they experience their first general anxiety episode is weak. The r^2 , which is 0.01, is very low for us to find the fraction of the variable that can be predicted from one variable to another.

0 notes

timothy-mokoka · 2 years ago

Text

Assignment 2: Hypothesis Testing with Chi-Square Test of Independence

Introduction:

This assignment examines a 2412 sample of Marijuana / Cannabis users from the NESRAC dataset between the ages of 18 and 30. My Research question is as follows:

Is the number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 the leading cause of mental health disorders such as depression and anxiety?

My Hypothesis Test statements are as follows:

H0: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is not the leading cause of mental health disorders such as depression and anxiety.

Ha: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is the leading cause of mental health disorders such as depression and anxiety.

Explanation of the Code:

I used the crosstabulation function to produce a contingency of observed counts and percentages of each mental health disorders, i.e. depression and anxiety. I did this in order to examine if whether the status (1 = Yes and 2 = No) of cannabis usage of the categorical explanatory variable ‘S3BQ1A5’ is correlated with the categorical response variables depression (‘MAJORDEP12’) and anxiety (‘GENAXDX12’). Therefore I ran a Chi-Square Test of Independence for these categorical variables twice, calculating the x-squared values for them and corresponding p-values so that the null and alternative hypothesis are corroborated or rejected with respect to the findings.

To visualize the associate relationship between the frequency of cannabis usage and the depression diagnosis I used the factor-plot function to produce the bivariate graph. I also used the crosstabulation function to test the association between the frequency of cannabis use (‘S3BQ1A5’) and general anxiety (‘GENAXDX12’). After the third Chi-Square Test of Independence I performed a Post Hoc Test using the Bonferroni Adjustment since the explanatory variable has more than two levels. Doing this makes it possible to identify instances where the null hypothesis can be rejected without making an extensive Type-I Error.

Code / Syntax:

-- coding: utf-8 --

""" Created on Fri Mar 31 12:20:15 2023

@author: Oteng """

import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt

nesarc = pandas.read_csv ('nesarc_pds.csv' , low_memory=False)

Sets pandas to show all columns in a dataframe

pandas.set_option('display.max_columns', None)

Sets pandas to show all rows in a dataframe

pandas.set_option('display.max_rows', None)

nesarc.columns = map(str.upper , nesarc.columns)

pandas.set_option('display.float_format' , lambda x:'%f'%x)

Changes the variables of interest to numeric

nesarc['AGE'] = pandas.to_numeric(nesarc['AGE'], errors='coerce') nesarc['S3BQ4'] = pandas.to_numeric(nesarc['S3BQ4'], errors='coerce') nesarc['S3BQ1A5'] = pandas.to_numeric(nesarc['S3BQ1A5'], errors='coerce') nesarc['S3BD5Q2B'] = pandas.to_numeric(nesarc['S3BD5Q2B'], errors='coerce') nesarc['S3BD5Q2E'] = pandas.to_numeric(nesarc['S3BD5Q2E'], errors='coerce') nesarc['MAJORDEP12'] = pandas.to_numeric(nesarc['MAJORDEP12'], errors='coerce') nesarc['GENAXDX12'] = pandas.to_numeric(nesarc['GENAXDX12'], errors='coerce')

Subset of my sample if interest

subset1 = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30)] # Ages between 18-30 subsetc1 = subset1.copy()

subset2 = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30) & (nesarc['S3BQ1A5']==1)] # Cannabis users, between age 18-30 subsetc2 = subset2.copy()

Setting missing data for frequency and cannabis use, variables S3BD5Q2E, S3BQ1A5

subsetc1['S3BQ1A5']=subsetc1['S3BQ1A5'].replace(9, numpy.nan) subsetc2['S3BD5Q2E']=subsetc2['S3BD5Q2E'].replace('BL', numpy.nan) subsetc2['S3BD5Q2E']=subsetc2['S3BD5Q2E'].replace(99, numpy.nan)

Contingency table of observed counts of major depression diagnosis (response variable) within cannabis use (explanatory variable), in ages 18-30

contab1=pandas.crosstab(subsetc1['MAJORDEP12'], subsetc1['S3BQ1A5']) print (contab1)

Column percentages

colsum=contab1.sum(axis=0) colpcontab=contab1/colsum print(colpcontab)

Chi-square calculations for major depression within cannabis use status

print ('Chi-square value, p value, expected counts, for major depression within cannabis use status') chsq1= scipy.stats.chi2_contingency(contab1) print (chsq1)

Contingency table of observed counts of geberal anxiety diagnosis (response variable) within cannabis use (explanatory variable), in ages 18-30

contab2=pandas.crosstab(subsetc1['GENAXDX12'], subsetc1['S3BQ1A5']) print (contab2)

Column percentages

colsum2=contab2.sum(axis=0) colpcontab2=contab2/colsum2 print(colpcontab2)

Chi-square calculations for general anxiety within cannabis use status

print ('Chi-square value, p value, expected counts, for general anxiety within cannabis use status') chsq2= scipy.stats.chi2_contingency(contab2) print (chsq2)

Contingency table of observed counts of major depression diagnosis (response variable) within frequency of cannabis use (10 level explanatory variable), in ages 18-30

contab3=pandas.crosstab(subset2['MAJORDEP12'], subset2['S3BD5Q2E']) print (contab3)

Column percentages

colsum3=contab3.sum(axis=0) colpcontab3=contab3/colsum3 print(colpcontab3)

Chi-square calculations for mahor depression within frequency of cannabis use groups

print ('Chi-square value, p value, expected counts for major depression associated frequency of cannabis use') chsq3= scipy.stats.chi2_contingency(contab3) print (chsq3)

recode1 = {1: 9, 2: 8, 3: 7, 4: 6, 5: 5, 6: 4, 7: 3, 8: 2, 9: 1} # Dictionary with details of frequency variable reverse-recode subsetc2['CUFREQ'] = subsetc2['S3BD5Q2E'].map(recode1) # Change variable name from S3BD5Q2E to CUFREQ

subsetc2["CUFREQ"] = subsetc2["CUFREQ"].astype('category')

Rename graph labels for better interpretation

subsetc2['CUFREQ'] = subsetc2['CUFREQ'].cat.rename_categories(["2 times/year","3-6 times/year","7-11 times/years","Once a month","2-3 times/month","1-2 times/week","3-4 times/week","Nearly every day","Every day"])

Graph percentages of major depression within each cannabis smoking frequency group

plt.figure(figsize=(12,4)) # Change plot size ax1 = seaborn.factorplot(x="CUFREQ", y="MAJORDEP12", data=subsetc2, kind="bar", ci=None) ax1.set_xticklabels(rotation=40, ha="right") # X-axis labels rotation plt.xlabel('Frequency of cannabis use') plt.ylabel('Proportion of Major Depression') plt.show()

Post hoc test, pair comparison of frequency groups 1 and 9, 'Every day' and '2 times a year'

recode2 = {1: 1, 9: 9} subsetc2['COMP1v9']= subsetc2['S3BD5Q2E'].map(recode2)

Contingency table of observed counts

ct4=pandas.crosstab(subsetc2['MAJORDEP12'], subsetc2['COMP1v9']) print (ct4)

Column percentages

colsum4=ct4.sum(axis=0) colpcontab4=ct4/colsum4 print(colpcontab4)

Chi-square calculations for pair comparison of frequency groups 1 and 9, 'Every day' and '2 times a year'

print ('Chi-square value, p value, expected counts, for pair comparison of frequency groups -Every day- and -2 times a year-') cs4= scipy.stats.chi2_contingency(ct4) print (cs4)

Post hoc test, pair comparison of frequency groups 2 and 6, 'Nearly every day' and 'Once a month'

recode3 = {2: 2, 6: 6} subsetc2['COMP2v6']= subsetc2['S3BD5Q2E'].map(recode3)

Contingency table of observed counts

ct5=pandas.crosstab(subsetc2['MAJORDEP12'], subsetc2['COMP2v6']) print (ct5)

Column percentages

colsum5=ct5.sum(axis=0) colpcontab5=ct5/colsum5 print(colpcontab5)

Chi-square calculations for pair comparison of frequency groups 2 and 6, 'Nearly every day' and 'Once a month'

print ('Chi-square value, p value, expected counts for pair comparison of frequency groups -Nearly every day- and -Once a month-') cs5= scipy.stats.chi2_contingency(ct5) print (cs5)

Output:

Explanation: When the relationship between the association of cannabis usage and major depression, the Chi-Square Test of Independence amongst young adults aged between 18 and 30 years shows that those who were cannabis users in the last 12 months, which constitutes about 18%, where more likely to have been diagnosed with major depression compared to the non-users of cannabis (8.4%). X2 = 171.6, 1 df, p-value = 3.16e-39. Since the p-value is extremely small, the results provide enough evidence against the null hypothesis. Thus, we accept the alternative hypothesis and reject the null hypothesis since there is a positive relationship / association between cannabis usage and major depression.

Explanation: When testing the relationship and association between cannabis use and general anxiety, the Chi-Square Test of Independence reveals that, amongst young adults aged between 18 and 30 years, those who were cannabis users were more likely to have been diagnosed with general anxiety in the last 12 months (3.8%), compared to the non-users of cannabis (1.6%), X2 = 40.22, 1 df, p-value = 2.26e-10. Thus these results provides enough evidence against the null hypothesis to safely reject it. Thus we accept the alternative hypothesis and reject the null hypothesis, which indicates a positive relationship between cannabis use and general anxiety.

Explanation: This third Chi-Square Test of Independence shows that, for cannabis users aged between 18 and 30 years, the frequency of cannabis usage and major depression for the past 12 months were significantly associated, X2 = 35.18, 10 df, p-value = 0.00011.

Explanation: The Bivariate graph above presenting my sample of interest shows that there is a positive correlation between the frequency of cannabis usage and major depression in the past 12 months. The distribution is skewed to the left which indicates that the more individuals aged 18 – 30 smoked cannabis the more chances they are to have or experience major depression in the past 12 months.

Explanation: The Post Hoc Test comparison of the Bonferroni Adjustment of the rate of major depression by the pairs “Every Day” and “2 times a year” frequency categories reveal a p-value of 0.00019 and the percentage of major depression diagnosis for each frequency category / group are 23.7% and 11.6% respectively. Thus, since the p-value is smaller than the Bonferroni Adjusted p-value (0.0011 > 0.00019) we can assume that these rates are different from one another. Therefore, we can safely reject the null hypothesis and accept the alternative hypothesis.

Explanation: With regards to the Post Hoc Test comparison with the Bonferroni Adjustment in relation to major depression by the pairs “Nearly every day” and “once a month” frequency categories, the p-value is 0.046 and the percentages of major depression for these two frequency category groups are 23.3% and 13.7% respectively. As a result, since the p-value is bigger than the Bonferroni Adjusted p-value (0.0011 < 0.046) we can safely assume that these rates are not significantly different from one another. Thus, in this instance, we can accept the null hypothesis and reject the alternative hypothesis.

0 notes

timothy-mokoka · 2 years ago

Text

Analysis of Variance and Hypothesis Testing For NESRAC dataset

This assessment examines a 2412 sample of Marijuana / Cannabis users from the NESRAC dataset.

The Research Question is:

Is the number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 the leading cause of mental health disorders such as depression and anxiety?

Thus resulting in the following hypothesis tests:

H0: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is not the leading cause of mental health disorders such as depression and anxiety.

Ha: The number of Cannabis joints smoked per day amongst young adults in USA between the Ages of 18 and 30 is the leading cause of mental health disorders such as depression and anxiety.

This question comprises of two categorical explanatory dependent variables, depression (‘MAJORDEP12’) and anxiety (‘GENAXDX12’) against the quantitative response variable quantity of cannabis joints smoked per day (‘S3BQ4’). Thus I ran the ANOVA method twice to test each categorical explanatory variable against the quantity of joints smoked per day to calculate the F-Statistic and its corresponding p-value. After that I used the ols function to test the frequency and association of Cannabis usage (‘S3BD5Q2E’) and the quantity of joints smoked per day. This is followed by the post hoc test using the Tukey HSDT in the context of the ANOVA since it has two levels. This makes it possible to safely reject the null hypothesis should there be an existence of a Type-I error.

The syntax of the code is as follows?

-- coding: utf-8 --

""" Created on Thu Mar 30 18:29:25 2023

@author: Oteng """

import pandas import numpy import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

loads the NESARC dataset

nesarc = pandas.read_csv ('nesarc_pds.csv' , low_memory=False)

Sets the pandas to show all columns in a data-frame

pandas.set_option('display.max_columns', None)

Sets the pandas to show all rows in a dataframe

pandas.set_option('display.max_rows', None)

nesarc.columns = map(str.upper , nesarc.columns)

pandas.set_option('display.float_format' , lambda x:'%f'%x)

This changes the variables of the dataset to numeric

nesarc['AGE'] = nesarc['AGE'].convert_objects(convert_numeric=True) nesarc['S3BQ4'] = nesarc['S3BQ4'].convert_objects(convert_numeric=True) nesarc['S3BQ1A5'] = nesarc['S3BQ1A5'].convert_objects(convert_numeric=True) nesarc['S3BD5Q2B'] = nesarc['S3BD5Q2B'].convert_objects(convert_numeric=True) nesarc['S3BD5Q2E'] = nesarc['S3BD5Q2E'].convert_objects(convert_numeric=True) nesarc['MAJORDEP12'] = nesarc['MAJORDEP12'].convert_objects(convert_numeric=True) nesarc['GENAXDX12'] = nesarc['GENAXDX12'].convert_objects(convert_numeric=True)

Makes a subset of the sample under question for Cannibus users between the Age 18 - 30

subset5 = nesarc[(nesarc['AGE']>=18) & (nesarc['AGE']<=30) & (nesarc['S3BQ1A5']==1)] subsetc5 = subset5.copy()

Setting missing data for quantity of Cannabis (measured in joints), variable for S3BQ4

subsetc5['S3BQ4']=subsetc5['S3BQ4'].replace(99, numpy.nan) subsetc5['S3BQ4']=subsetc5['S3BQ4'].replace('BL', numpy.nan)

sub1 = subsetc5[['S3BQ4', 'MAJORDEP12']].dropna()

Uses ols function for calculating the F-statistic and the corresponding p-value

Mental Health/Depression (categorical, explanatory variable) and joints Quantity (quantitative, response variable) correlation

model1 = smf.ols(formula='S3BQ4 ~ C(MAJORDEP12)', data=sub1) results1 = model1.fit() print (results1.summary())

Measures Mean and spread for categorical variable MAJORDEP12, major depression

print ('Means for joints quantity by major depression status') m1= sub1.groupby('MAJORDEP12').mean() print (m1)

print ('Standard deviations for joints quantity by major depression status') sd1 = sub1.groupby('MAJORDEP12').std() print (sd1)

sub2 = subsetc5[['S3BQ4', 'GENAXDX12']].dropna()

Using ols function for calculating the F-statistic and the associated p value

Anxiety (categorical, explanatory variable) and joints quantity (quantitative, response variable) correlation

model2 = smf.ols(formula='S3BQ4 ~ C(GENAXDX12)', data=sub2) results2 = model2.fit() print (results2.summary())

Measure mean and spread for categorical variable GENAXDX12, general anxiety

print ('Means for joints quantity by major general anxiety status') m2= sub2.groupby('GENAXDX12').mean() print (m2)

print ('Standard deviations for joints quantity by general anxiety status') sd2 = sub2.groupby('GENAXDX12').std() print (sd2)

Sets the Missing Data for frequency of cannabis use, variable S3BD5Q2E

subsetc5['S3BD5Q2E']=subsetc5['S3BD5Q2E'].replace(99, numpy.nan) subsetc5['S3BD5Q2E']=subsetc5['S3BD5Q2E'].replace('BL', numpy.nan)

sub3 = subsetc5[['S3BQ4', 'S3BD5Q2E']].dropna()

Using ols function for calculating the F-statistic and associated p-value

Frequency for Cannabis use (10 level categorical, explanatory variable) and joints quantity (quantitative, response variable) correlation

model3 = smf.ols(formula='S3BQ4 ~ C(S3BD5Q2E)', data=sub3).fit() print (model3.summary())

Measures the Mean and spread for the categorical variable S3BD5Q2E, frequency of Cannabis use

print ('Means for joints quantity by frequency of cannabis use status') mc2= sub3.groupby('S3BD5Q2E').mean() print (mc2)

print ('Standard deviations for joints quantity by frequency of cannabis use status') sdc2 = sub3.groupby('S3BD5Q2E').std() print (sdc2)

Run a post hoc test (paired comparisons), using Tukey HSDT

mc1 = multi.MultiComparison(sub3['S3BQ4'], sub3['S3BD5Q2E']) res1 = mc1.tukeyhsd() print(res1.summary())

Output: ANOVA 1

Interpretation: The ANOVA reveals that for the young adults between the ages of 18 and 30, according the means and standard deviations, those with chronic depression reportedly smoked more joints per day (Mean = 3.04 & std = +/-5.22) than those without chronic depression(Mean = 2.29 & std = +/-4.16), with an F-Statistic of 7.682 and an associate p-value of 0.00562 which is less than 0.05. Thus the p-value is too smaller than the required p-value of 0.05. Thus we reject the null hypothesis and accept the alternative hypothesis as there is a correlation and an association between the quantity of joints smoked per pay and being diagnosed with depression amongst the age group

Output: ANOVA 2

Interpretation: The ANOVA for anxiety and quantity of joints smoked per day amongst young adults between the ages of 18 and 30 for the past 12 months anxiety diagnosis reveals that those with chronic anxiety (Mean = 2.68 & std = +/- 3.15) smoke almost an equal amount of joints per day compared to those without chronic anxiety(Mean = 2.5 & std = +/- 4.42). They both have an F-Statistic of 0.1411 and a p-value of 0.707 which is greater than a p-value of 0.05. Thus we accept the null hypothesis that there is no association between the quantity of joints smoked per day and chronic anxiety.

Output: ANOVA3 & Post Hoc Test

Interpretation: The ANOVA shows that in the midst of the daily cannabis users aged between 18 and 30, the frequency of use and the number of joints smoked per day are relatively associated with one another. They have an F-Statistic of 52.65 and a very small p-value of 1.76e-87 which is smaller than the required 0.05 to accept the null hypothesis.

The Post Hoc test of the mean of the quantity of joints smoked per day by the pairs of cannabis usage frequency category shows that those young adults using cannabis daily reportedly smoked more on average with (Mean = 5.66 & std=+/-7.8 everyday and Mean = 3.73 & std +/- 4.46 nearly everyday) compared to those that smoked once or twice a week (Mean = 1.85 & std = +/- 1.81), or less. Thus this shows that there is a positive association between the frequency and the quantity of cannabis smoked per day.

Output: Frequency Table & Tukey HSD Test Results

Conclusion: The table above shows the frequency of cannabis usage and the quantity of joints smoked per day using the Tukey HSDT in the post hoc test in context of the ANOVAs. It shows the differences in smoking quantity and associate frequency groupings, helping us to spot in which groups we can reject the null hypothesis and in which group we can accept the alternative hypothesis, and vice versa. In instances where rejections are false, it results in an increased Type-I Error of rejecting the null hypothesis.

1 note · View note