#Analysis_tools Coursera
Explore tagged Tumblr posts
Text
The significance of alcohol consumption and average income on suicide rate
I have performed some data analysis on the data gathered by the gap minder foundation in Stockholm to help in the UN development . i have used the same gathered data to analyze the effect of alcohol consumption and average income on the suicide rate per 100 people, the analysis was conducted using the OLS regression technique from the python stats model.
the input of average income was divided into two categories, greater than or equal $2000 per month and less than $2000 USD per month
the input of the alcohol consumption was divided into two categories according to their consumption per liter
category 0 below 3L per month
category 1 from 3L to 9L consumption per month
the code was as following
Created on Sat Aug 8 10:27:52 2020
@author: omar.elfarouk """
import numpy import pandas as pd import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi from pandas import DataFrame as df
data = pd.read_csv('gapminder.csv', low_memory=False) df = pd.DataFrame(data) #setting variables you will be working with to numeric df = df.replace(r'\s+', 0, regex=True) #Replace empty strings with zero
#subset data to income per person , alcohol consumption ,suiside rate , and employment sub1=data sub1 = sub1.replace(r'\s+', 0, regex=True) #Replace empty strings with zero #SETTING MISSING DATA
# Creating a secondary variable multiplying income by alcohol consumption by employment rate
#sub1['suicideper100th']=sub1['suicideper100th'].replace(0, numpy.nan)
sub1['suicideper100th']= pd.to_numeric(sub1['suicideper100th'])
#sub1['Income']= pd.to_numeric(sub1['Income']) ct1 = sub1.groupby('suicideper100th').size() print (ct1)
# using ols function for calculating the F-statistic and associated p value model1 = smf.ols(formula='suicideper100th ~ C(Income)', data=sub1).fit() results1 = model1 print (results1.summary())
sub2 = sub1[['suicideper100th', 'Income']].dropna()
print ('means for income by suicide status') m1= sub2.groupby('Income').mean() print (m1)
print ('standard deviations for income suiside status') sd1 = sub2.groupby('Income').std() print (sd1) #i will call it sub3 sub3 = sub1[['suicideper100th', 'Alcoholuse']].dropna()
model2 = smf.ols(formula='suicideper100th ~ C(Alcoholuse)', data=sub3).fit() print (model2.summary())
print ('means for alcohol use by suicide status') m2= sub3.groupby('Alcoholuse').mean() print (m2)
print ('standard deviations for alcohol use by suicide') sd2 = sub3.groupby('Alcoholuse').std() print (sd2) #tuckey honesty test comparision for post hoc test mc1 = multi.MultiComparison(sub3['suicideper100th'], sub3['Alcoholuse']) res1 = mc1.tukeyhsd() print(res1.summary())
the null hypothesis indicates that there is no difference in the level of consumption of alcohol on the suicide rate and also there is no difference in the income level on the suicide rate.
the alternative hypothesis is that there is a significance difference on the alcohol consumption and the average income on the suicide rate.
the results are displayed as following
OLS Regression Results ============================================================================== Dep. Variable: suicideper100th R-squared: 0.013 Model: OLS Adj. R-squared: 0.009 Method: Least Squares F-statistic: 2.875 Date: Sun, 09 Aug 2020 Prob (F-statistic): 0.0914 Time: 02:48:14 Log-Likelihood: -703.84 No. Observations: 213 AIC: 1412. Df Residuals: 211 BIC: 1418. Df Model: 1 Covariance Type: nonrobust
the low value of F- statistics and P value being greater that 0.025 indicates that we have failed to reject the null hypothesis and we accept the fact that there is no significant difference on the effect of annual income value on the suicide rate
OLS Regression Results ============================================================================== Dep. Variable: suicideper100th R-squared: 0.006 Model: OLS Adj. R-squared: -0.004 Method: Least Squares F-statistic: 0.5930 Date: Sun, 09 Aug 2020 Prob (F-statistic): 0.554 Time: 02:48:14 Log-Likelihood: -704.69 No. Observations: 213 AIC: 1415. Df Residuals: 210 BIC: 1425. Df Model: 2 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- Intercept 7.7779 0.942 8.254 0.000 5.920 9.635 C(Alcoholuse)[T.1] 0.9818 1.204 0.815 0.416 -1.392 3.355 C(Alcoholuse)[T.2] 1.2756 1.190 1.072 0.285 -1.071 3.622
the low value of F- statistics and P value being greater that 0.025 indicates that we have failed to reject the null hypothesis and we accept the fact that there is no significant difference on the effect of alcohol consumption level on the suicide rate.
another analysis have been conducted,which is called the post hoc test, it is used to analyze the difference between the groups of categorical level without increasing the type 1 error in an accumulative manner. we use the Tuckey honesty test for post hoc comparison. and it agrees with the fact that there is no difference between the alcohol usage levels on the suicide rate .
means for alcohol use by suicide status suicideper100th Alcoholuse 0 7.777891 1 8.759692 2 9.053453 standard deviations for alcohol use by suicide suicideper100th Alcoholuse 0 6.086994 1 5.809631 2 7.663338 Multiple Comparison of Means - Tukey HSD, FWER=0.05 =================================================== group1 group2 meandiff p-adj lower upper reject --------------------------------------------------- 0 1 0.9818 0.678 -1.8605 3.8241 False 0 2 1.2756 0.5313 -1.5338 4.0849 False 1 2 0.2938 0.9 -2.1713 2.7588 False ---------------------------------------------------
1 note
·
View note