#Analysis_tools  Coursera
Explore tagged Tumblr posts
omarelfarouk90 · 5 years ago
Text
The significance of alcohol consumption and average income on suicide rate
I have performed some data analysis on the data gathered by the gap minder foundation in Stockholm to help in the UN development . i have used the same gathered data to analyze the effect of alcohol consumption and average income on the suicide rate per 100 people, the analysis was conducted using the OLS regression technique from the python stats model.
the input of average income was divided into two categories, greater than or equal $2000 per month and less than $2000 USD per month
the input of the alcohol consumption was divided into two categories according to their consumption per liter
category 0  below 3L per month
category 1 from 3L to 9L consumption per month
the code was as following
Created on Sat Aug  8 10:27:52 2020
@author: omar.elfarouk """
import numpy import pandas as pd import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi from pandas import DataFrame as df
data = pd.read_csv('gapminder.csv', low_memory=False) df = pd.DataFrame(data) #setting variables you will be working with to numeric df = df.replace(r'\s+', 0, regex=True) #Replace empty strings with zero
#subset data to income per person , alcohol consumption ,suiside rate , and employment sub1=data sub1 = sub1.replace(r'\s+', 0, regex=True) #Replace empty strings with zero #SETTING MISSING DATA
# Creating a secondary variable multiplying income by alcohol consumption by employment rate
#sub1['suicideper100th']=sub1['suicideper100th'].replace(0, numpy.nan)
sub1['suicideper100th']= pd.to_numeric(sub1['suicideper100th'])
#sub1['Income']= pd.to_numeric(sub1['Income']) ct1 = sub1.groupby('suicideper100th').size() print (ct1)
# using ols function for calculating the F-statistic and associated p value model1 = smf.ols(formula='suicideper100th ~ C(Income)', data=sub1).fit() results1 = model1 print (results1.summary())
sub2 = sub1[['suicideper100th', 'Income']].dropna()
print ('means for income by suicide status') m1= sub2.groupby('Income').mean() print (m1)
print ('standard deviations for income suiside status') sd1 = sub2.groupby('Income').std() print (sd1) #i will call it sub3 sub3 = sub1[['suicideper100th', 'Alcoholuse']].dropna()
model2 = smf.ols(formula='suicideper100th ~ C(Alcoholuse)', data=sub3).fit() print (model2.summary())
print ('means for alcohol use by suicide status') m2= sub3.groupby('Alcoholuse').mean() print (m2)
print ('standard deviations for alcohol use by suicide') sd2 = sub3.groupby('Alcoholuse').std() print (sd2) #tuckey honesty test comparision for post hoc test mc1 = multi.MultiComparison(sub3['suicideper100th'], sub3['Alcoholuse']) res1 = mc1.tukeyhsd() print(res1.summary())
the null hypothesis indicates that there is no difference in the level of consumption of alcohol on the suicide rate and also there is no difference in the income level on the suicide rate.
the alternative hypothesis is that there is a significance difference on the alcohol consumption and the average income on the suicide rate.
the results are displayed as following
OLS Regression Results                             ============================================================================== Dep. Variable:        suicideper100th   R-squared:                       0.013 Model:                            OLS   Adj. R-squared:                  0.009 Method:                 Least Squares   F-statistic:                     2.875 Date:                Sun, 09 Aug 2020   Prob (F-statistic):             0.0914 Time:                        02:48:14   Log-Likelihood:                -703.84 No. Observations:                 213   AIC:                             1412. Df Residuals:                     211   BIC:                             1418. Df Model:                           1                                         Covariance Type:            nonrobust      
the low value of F- statistics and P value being greater that 0.025 indicates that we have failed to reject the null hypothesis and we accept the fact that there is no significant difference on the effect of annual income value on the suicide rate
 OLS Regression Results                             ============================================================================== Dep. Variable:        suicideper100th   R-squared:                       0.006 Model:                            OLS   Adj. R-squared:                 -0.004 Method:                 Least Squares   F-statistic:                    0.5930 Date:                Sun, 09 Aug 2020   Prob (F-statistic):              0.554 Time:                        02:48:14   Log-Likelihood:                -704.69 No. Observations:                 213   AIC:                             1415. Df Residuals:                     210   BIC:                             1425. Df Model:                           2                                         Covariance Type:            nonrobust                                         ======================================================================================                         coef    std err          t      P>|t|      [0.025      0.975] -------------------------------------------------------------------------------------- Intercept              7.7779      0.942      8.254      0.000       5.920       9.635 C(Alcoholuse)[T.1]     0.9818      1.204      0.815      0.416      -1.392       3.355 C(Alcoholuse)[T.2]     1.2756      1.190      1.072      0.285      -1.071       3.622
the low value of F- statistics and P value being greater that 0.025 indicates that we have failed to reject the null hypothesis and we accept the fact that there is no significant difference on the effect of alcohol consumption level on the suicide rate.
another analysis have been conducted,which is called the post hoc test, it is used to analyze the difference between the groups of categorical level without increasing the type 1 error in an accumulative manner. we use  the Tuckey honesty test for post hoc comparison. and it agrees with the fact that there is no difference between the alcohol usage levels on the suicide rate .
means for alcohol use by suicide status            suicideper100th Alcoholuse                 0                  7.777891 1                  8.759692 2                  9.053453 standard deviations for alcohol use by suicide            suicideper100th Alcoholuse                 0                  6.086994 1                  5.809631 2                  7.663338 Multiple Comparison of Means - Tukey HSD, FWER=0.05 =================================================== group1 group2 meandiff p-adj   lower  upper  reject ---------------------------------------------------     0      1   0.9818  0.678 -1.8605 3.8241  False     0      2   1.2756 0.5313 -1.5338 4.0849  False     1      2   0.2938    0.9 -2.1713 2.7588  False ---------------------------------------------------
1 note · View note