shskpadhy
shskpadhy
Assignment
14 posts
Data Science Assignment
Don't wanna be here? Send us removal request.
shskpadhy · 5 years ago
Text
Assignment 03-04 (Testing a Logistic Regression Model)
Dataset : Gapminder
Variables
The following derived variabbles (obtained by categorizing provided variables) are used:
lifgrps (response variable) : derived from lifexpectancy by setting the calue for lifexpectancy greater than or equal to 65 as 1 else 0.
urbgrps (primary explanatory variable) : derived from urbanrate, for individuals with urbanrate more than mean (urb_mean), value is 1 else 0.
alcgrps  : derived from alcconsumption, for individuals with alcconsumption more than mean (alc_mean), value is 1 else 0.
incgrps  : derived from incomeperperson, for individuals with incomeperperson more than mean (inc_mean), value is 1 else 0.
relgrps  : derived from relectricperperson, for individuals with relectricperperson more than mean (rel_mean), value is 1 else 0.
The explanation of the variables were provided in the previous post.
Research Question
H0 :There does not exist an association between urbanrate and lifexpectancy
H1 : Lifexpectancy increases with urbanrate    
Here we would test the research question as:
H1 : Number of countries with lifgrps = 1 in urbgrps = 1 category is more than that of the category urbgrps = 0.
H0 : There does not exist such association
Output
Rows 213 columns 16 =================================== Logistic Regression Modelling =================================== lreg1 : lifgrps ~ urbgrps Optimization terminated successfully.         Current function value: 0.591261         Iterations 5                           Logit Regression Results                           ============================================================================== Dep. Variable:                lifgrps   No. Observations:                  213 Model:                          Logit   Df Residuals:                      211 Method:                           MLE   Df Model:                            1 Date:                Fri, 24 Jul 2020   Pseudo R-squ.:                 0.07575 Time:                        23:20:52   Log-Likelihood:                -125.94 converged:                       True   LL-Null:                       -136.26 Covariance Type:            nonrobust   LLR p-value:                 5.534e-06 =================================================================================                    coef    std err          z      P>|z|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept         0.0202      0.201      0.101      0.920      -0.374       0.414 urbgrps[T.1L]     1.3552      0.308      4.400      0.000       0.751       1.959 ================================================================================= Odds Ratios Intercept       1.020408 urbgrps[T.1L]   3.877391 dtype: float64 odd ratios with 95% confidence intervals               Lower CI  Upper CI       OR Intercept      0.688125  1.513145 1.020408 urbgrps[T.1L]  2.120087  7.091294 3.877391 -------------------------------- Optimization terminated successfully.         Current function value: 0.625930         Iterations 5                           Logit Regression Results                           ============================================================================== Dep. Variable:                lifgrps   No. Observations:                  213 Model:                          Logit   Df Residuals:                      211 Method:                           MLE   Df Model:                            1 Date:                Fri, 24 Jul 2020   Pseudo R-squ.:                 0.02155 Time:                        23:20:52   Log-Likelihood:                -133.32 converged:                       True   LL-Null:                       -136.26 Covariance Type:            nonrobust   LLR p-value:                   0.01537 ==============================================================================                 coef    std err          z      P>|z|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept      0.4055      0.179      2.265      0.024       0.055       0.756 alcgrps        0.7419      0.313      2.371      0.018       0.129       1.355 ============================================================================== Hence lifgrps is not associated with alcgrps -------------------------------- Optimization terminated successfully.         Current function value: 0.631670         Iterations 5                           Logit Regression Results                           ============================================================================== Dep. Variable:                lifgrps   No. Observations:                  213 Model:                          Logit   Df Residuals:                      211 Method:                           MLE   Df Model:                            1 Date:                Fri, 24 Jul 2020   Pseudo R-squ.:                 0.01258 Time:                        23:20:53   Log-Likelihood:                -134.55 converged:                       True   LL-Null:                       -136.26 Covariance Type:            nonrobust   LLR p-value:                   0.06407 =================================================================================                    coef    std err          z      P>|z|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept         0.4841      0.175      2.772      0.006       0.142       0.826 incgrps[T.1L]     0.5788      0.318      1.819      0.069      -0.045       1.203 ================================================================================= Hence lifgrps is not associated with alcgrps --------------------------------
                          Logit Regression Results                           ============================================================================== Dep. Variable:                lifgrps   No. Observations:                  213 Model:                          Logit   Df Residuals:                      211 Method:                           MLE   Df Model:                            1 Date:                Fri, 24 Jul 2020   Pseudo R-squ.:                 0.01258 Time:                        23:20:53   Log-Likelihood:                -134.55 converged:                       True   LL-Null:                       -136.26 Covariance Type:            nonrobust   LLR p-value:                   0.06407 =================================================================================                    coef    std err          z      P>|z|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept         0.4841      0.175      2.772      0.006       0.142       0.826 relgrps[T.1L]     0.5788      0.318      1.819      0.069      -0.045       1.203 ================================================================================= Hence lifgrps is not associated with alcgrps --------------------------------
Summary
The logistic regression model with lifgrps as response variable and urbgrps as explanatory variable depicts that lifgrps is well associated with urbgrps. The following statistics were obtained from summary:
p-value : less than 0.0001
odds : (2.120087,  7.091294, 3.877391) for lower, upper and OR.  95% confidence interval was taken
Then association of lifgrps with alcgrps, incgrps and relgrps was also tested individually but the results showed that no association exists as can be interpreted by higher p-values.
Regarding the research question, the null hypothesis can be rejected as there is enough evidence against it, as can be seen from significant p-values. Thus there is an association between lifgrps and urbgrps.
Finally The Code
import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import scipy.stats
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy # incomeperperson
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True) data['relectricperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data2 = data
# Categorizing lifeexpectancy as lifgrps
def lifgrps (row):   if row['lifeexpectancy'] >= 65 :      return 1   else :      return 0   data2['lifgrps'] = data2.apply (lambda row: lifgrps (row),axis=1) data2['lifgrps'] = data2['lifgrps'].convert_objects(convert_numeric=True)
# Logistic Regression Modelling
print ('===================================') print ('Logistic Regression Modelling') print ('===================================')
# Categorizing urbanrate as urbgrps
urb_mean = data2['urbanrate'].mean() def urbgrps (row):   if row['urbanrate'] <= urb_mean :      return 0   else :      return 1 data2['urbgrps'] = data2.apply (lambda row: urbgrps (row),axis=1) data2["urbgrps"] = data2["urbgrps"].astype('category')
print ('lreg1 : lifgrps ~ urbgrps') lreg1 = smf.logit(formula = 'lifgrps ~ urbgrps', data = data2).fit() print (lreg1.summary())
# odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params))
# odd ratios with 95% confidence intervals print ('odd ratios with 95% confidence intervals') params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
print ('--------------------------------')
# categorizing alcconsumption into grps alc_mean = data2['alcconsumption'].mean() def alcgrps (row):   if row['alcconsumption'] >= alc_mean :      return 1   else :      return 0   data2['alcgrps'] = data2.apply (lambda row: alcgrps (row),axis=1) data2['alcgrps'] = data2['alcgrps'].convert_objects(convert_numeric=True)
lreg2_1 = smf.logit(formula = 'lifgrps ~ alcgrps', data = data2).fit() print (lreg2_1.summary())
print ('Hence lifgrps is not associated with alcgrps')
print ('--------------------------------')
# categorizing incomeperperson inc_mean = data2['incomeperperson'].mean() def incgrps (row):   if row['incomeperperson'] <= inc_mean :      return 0   else :      return 1 data2['incgrps'] = data2.apply (lambda row: incgrps (row),axis=1) data2["incgrps"] = data2["incgrps"].astype('category')
lreg3_1 = smf.logit(formula = 'lifgrps ~ incgrps', data = data2).fit() print (lreg3_1.summary())
print ('Hence lifgrps is not associated with alcgrps')
print ('--------------------------------')
# Categorizing urbanrate as urbgrps
rel_mean = data2['relectricperperson'].mean() def relgrps (row):   if row['relectricperperson'] <= rel_mean :      return 0   else :      return 1 data2['relgrps'] = data2.apply (lambda row: relgrps (row),axis=1) data2["relgrps"] = data2["relgrps"].astype('category')
lreg4_1 = smf.logit(formula = 'lifgrps ~ relgrps', data = data2).fit() print (lreg4_1.summary())
print ('Hence lifgrps is not associated with alcgrps')
print ('--------------------------------')
0 notes
shskpadhy · 5 years ago
Text
Assignment 03_03 (Test A Multiple Regression Model)
Dataset 
GapMinder
Variables
·      alcconsumption (response variable) : per capita (age : 15+ years) alcohol consumption of a country
·      urbanrate (primary explanatory variable) : percentage of population of the country settled in urban areas
·      internetuserate : number of people per 100, who have access to the world wide web
·      incomeperperson : Gross Domestic Product per capita in constant 2000 US$
Primary Research Question
H0 : there is no association between alcconsumption and urbanrate
HA : alcconsumption increases with urbanrate
Summary
·      Considering the research question, we get enough evidence against H0 and hence conclude that alcconsumption increases with urbanrate. This can be seen from the summary of reg1 (specification with urbanrate as the only explanatory variable and alcconsumption as response variable), which has p-value 0.000171 and r2 value 0.075. The equation of regression is y = 6.8453 + 0.0591*xurb_c (urb_c is the urbanrate with mean centered at 0). The standardized residuals plot tells that the model is acceptable as less than 5% residuals fall outside |y|=2. Still the r2 value is very less.
·      In reg2, int_c (internetuserate with mean centred at 0) was added, seeking greater r2. It was found that internetuserate confounded urbanrate. r2 = 0.303 and p-value = 1.8*10-14 Also equation : y = 7.7316 + 0.1077*xint_c + 0.0010*xint_c**2 From the regression diagnostic plots also it is clear that the model is acceptable.
·      Taking the quest ahead, in reg3, inc_c (incomeperperson with mean centred at 0) was taken into account. Now int_c2 got confounded by inc_c, hence we remove it. r2= 0.338 and p-value= 3.96e-16 Regression line : y = 6.7842 + 0.1484*int_c + 0.0002*inc_c Again reg3 can be accepted as is evident from the diagnostic plots (<5% standard residuals fall beyond |y|=2).
·      Hence the final model is reg3, it does not include the primary explanatory variable.
·      Considering the regression diagnostic plots:
o   QQ-plots for reg2 and reg3 are almost similar
o   QQ-plot for reg1 shows bit more deviation indicating more errors
o   Again standard residuals plot for reg2 and reg3 are similar with only 3 (1.71%) observations falling beyond |y|=2 , which is 4 (2.29%) for reg1. As both the percentages are below 5%, the models are acceptable.
o   The leverage plot  of reg3shows that the residuals decrease with increase in inc_c and also with increase in int_c. This means the residuals are not independent of the values of the explanatory variables. Thus there must be another explanatory variable that is also associated with the response variable alcconsumption.
  Output
================================== Multiple Regression Modelling ================================== Starting with primary explanatory variavle : urbanrate                            OLS Regression Results                             ============================================================================== Dep. Variable:         alcconsumption   R-squared:                       0.075 Model:                            OLS   Adj. R-squared:                  0.070 Method:                 Least Squares   F-statistic:                     14.73 Date:                Mon, 20 Jul 2020   Prob (F-statistic):           0.000171 Time:                        17:50:32   Log-Likelihood:                -543.98 No. Observations:                 183   AIC:                             1092. Df Residuals:                     181   BIC:                             1098. Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept      6.8453      0.353     19.409      0.000       6.149       7.541 urb_c          0.0591      0.015      3.838      0.000       0.029       0.089 ============================================================================== Omnibus:                       10.025   Durbin-Watson:                   1.958 Prob(Omnibus):                  0.007   Jarque-Bera (JB):               10.257 Skew:                           0.573   Prob(JB):                      0.00592 Kurtosis:                       3.178   Cond. No.                         23.0 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. --------------------------------------
Including internetuserate
============================================================================== Dep. Variable:         alcconsumption   R-squared:                       0.303 Model:                            OLS   Adj. R-squared:                  0.296 Method:                 Least Squares   F-statistic:                     38.13 Date:                Mon, 20 Jul 2020   Prob (F-statistic):           1.80e-14 Time:                        17:52:17   Log-Likelihood:                -504.45 No. Observations:                 178   AIC:                             1015. Df Residuals:                     175   BIC:                             1024. Df Model:                           2                                         Covariance Type:            nonrobust                                         =================================================================================                    coef    std err          t      P>|t|      [0.025      0.975] --------------------------------------------------------------------------------- Intercept         7.7316      0.485     15.937      0.000       6.774       8.689 int_c             0.1077      0.013      8.497      0.000       0.083       0.133 I(int_c ** 2)    -0.0010      0.000     -2.116      0.036      -0.002   -6.72e-05 ============================================================================== Omnibus:                        6.860   Durbin-Watson:                   2.042 Prob(Omnibus):                  0.032   Jarque-Bera (JB):                7.862 Skew:                           0.301   Prob(JB):                       0.0196 Kurtosis:                       3.836   Cond. No.                     1.67e+03 ==============================================================================
As r-square value for int_c is greater than urb_c, we remove urb_c from our model Current r-square : 0.303, p-value :  1.80e-14 ------------------------------------------------------------------------------------ Considering incomeperperson
============================================================================== Dep. Variable:         alcconsumption   R-squared:                       0.338 Model:                            OLS   Adj. R-squared:                  0.330 Method:                 Least Squares   F-statistic:                     43.90 Date:                Mon, 20 Jul 2020   Prob (F-statistic):           3.96e-16 Time:                        17:59:34   Log-Likelihood:                -491.08 No. Observations:                 175   AIC:                             988.2 Df Residuals:                     172   BIC:                             997.7 Df Model:                           2                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept      6.7842      0.311     21.815      0.000       6.170       7.398 int_c          0.1484      0.018      8.063      0.000       0.112       0.185 inc_c         -0.0002   5.02e-05     -3.614      0.000      -0.000   -8.23e-05 ============================================================================== Omnibus:                        5.758   Durbin-Watson:                   2.004 Prob(Omnibus):                  0.056   Jarque-Bera (JB):                6.258 Skew:                           0.273   Prob(JB):                       0.0438 Kurtosis:                       3.749   Cond. No.                     1.05e+04 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.05e+04. This might indicate that there are strong multicollinearity or other numerical problems.
As int_c**2 get confounded by inc_c, we remobe int_c**2 Current r-square : 0.338, p-value : 3.96e-16 ---------------------------------------------------------- ---------------------------------------------------------- ============================================ Regression Diagnostic Plots ============================================ QQ-plots for all stages
Tumblr media Tumblr media Tumblr media
S
-------------------------------------
Standard Residuals for all stages
reg1
Tumblr media
4 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem
reg2
Tumblr media
3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem
reg3
Tumblr media
3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem For reg2 amd reg3,there is one extreme outlier each (falls above y=3) ----------------------------------------------------- Leverage Plot
Tumblr media Tumblr media
Finally The Code
# Assignment 03-03
import numpy import pandas import statsmodels.api as sm import seaborn import statsmodels.formula.api as smf import matplotlib.pyplot as plt
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption (response variable) # urbanrate (primary explanatory variable) # Internetuserate # incomeperperson
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
# Multiple Regression Modelling
print ('==================================') print ('Multiple Regression Modelling') print ('==================================')
# urbanrate and alcconsumption print ('Starting with primary explanatory variavle : urbanrate')
data['urb_c'] = data['urbanrate'] - data['urbanrate'].mean()
reg1 = smf.ols(formula = 'alcconsumption ~ urb_c', data=data).fit() print (reg1.summary())
print ('--------------------------------------')
# Including internetuserate
print ('Including internetuserate')
data['int_c'] = data['internetuserate'] - data['internetuserate'].mean()
reg2 = smf.ols(formula = 'alcconsumption ~  int_c + I(int_c**2)', data=data).fit() print (reg2.summary())
print ('Current r-square : 0.075, p-value : 0.000171')
print ('--------------------------------------')
# Considering internetuserate
print ('Considering internetuserate')
data['int_c'] = data['internetuserate'] - data['internetuserate'].mean()
reg2 = smf.ols(formula = 'alcconsumption ~  int_c + I(int_c**2)', data=data).fit() print (reg2.summary())
print ('As r-square value for int_c is greater than urb_c, we remove urb_c from our model') print ('Current r-square : 0.303, p-value :  1.80e-14')
print ('------------------------------------------------------------------------------------')
# Considering incomeperperson
print ('Considering incomeperperson')
data['inc_c'] = data['incomeperperson'] - data['incomeperperson'].mean()
reg3 = smf.ols(formula = 'alcconsumption ~  int_c + inc_c', data=data).fit() print (reg3.summary())
print ('As int_c**2 get confounded by inc_c, we remobe int_c**2') print ('Current r-square : 0.338, p-value : 3.96e-16')
print ('----------------------------------------------------------') print ('----------------------------------------------------------')
# Regression Diagnostic Plots
print ('============================================') print ('Regression Diagnostic Plots') print ('============================================')
print ('QQ-plots for all stages') err_qq_1 = sm.qqplot(reg1.resid, line='r') err_qq_2 = sm.qqplot(reg2.resid, line='r') err_qq_3 = sm.qqplot(reg3.resid, line='r') print ('-------------------------------------')
print ('Standardized residuals for all stages')
print ('reg1') std_res_1 = pandas.DataFrame(reg1.resid_pearson) err_std_1 = plt.plot(std_res_1, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_1)           print ('4 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')
print ('reg2') std_res_2 = pandas.DataFrame(reg2.resid_pearson) err_std_2 = plt.plot(std_res_2, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_2)           print ('3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')
print ('reg3') std_res_3 = pandas.DataFrame(reg3.resid_pearson) err_std_3 = plt.plot(std_res_3, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_3)           print ('3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')
print ('For reg2 amd reg3,there is one extreme outlier each (falls above y=3)')
print ('-----------------------------------------------------')
# Leverage Plot print ('Leverage Plot') err_lev_1 = plt.figure() err_lev_1 = sm.graphics.plot_regress_exog(reg3, 'int_c', fig=err_lev_1) err_lev_1
err_lev_2 = plt.figure() err_lev_2 = sm.graphics.plot_regress_exog(reg3, 'inc_c', fig=err_lev_2) err_lev_2
0 notes
shskpadhy · 5 years ago
Text
Assignment 03-02 (Testing A Basic Linear Regression Model)
Dataset : GapMinder
Variables
urbanrate  (primary explanatory variable) : centered at 0 and stored in urb_c, this urb_c is used for modelling the linear regression model
lifeexpectancy (response variable)
Summary
slope : 0.2628
intercept : 69.6752
Hence y = 69.6752 + 0.2628x, where x = urban rate -  56.7693596059(mean urban rate) and y = expected life-expectancy
Value of r-squared = 0.375, F-statistic = 104.6 and p-value = 1.61e-19 and correlation coefficien t = 0.6127112161764898 
Output
Rows 213 columns 16 ================================== Regression Modelling ================================== Centring explanatory variable urbanrate and storing it into variable urb_c mean of urbanrate is 56.7693596059 Describing urb_c count   203.000000 mean      0.000000 std      23.844933 min     -46.369360 25%     -19.939360 50%       1.170640 75%      17.440640 max      43.230640 Name: urb_c, dtype: float64 Finally the regression model                            OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.375 Model:                            OLS   Adj. R-squared:                  0.372 Method:                 Least Squares   F-statistic:                     104.6 Date:                Fri, 17 Jul 2020   Prob (F-statistic):           1.61e-19 Time:                        17:46:01   Log-Likelihood:                -610.02 No. Observations:                 176   AIC:                             1224. Df Residuals:                     174   BIC:                             1230. Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept     69.6752      0.589    118.201      0.000      68.512      70.839 urb_c          0.2628      0.026     10.227      0.000       0.212       0.313 ============================================================================== Omnibus:                       11.099   Durbin-Watson:                   1.866 Prob(Omnibus):                  0.004   Jarque-Bera (JB):               12.104 Skew:                          -0.637   Prob(JB):                      0.00235 Kurtosis:                       2.827   Cond. No.                         23.0 ==============================================================================
Running Pearsons Correlation Test for getting value of correlation coefficient (0.6127112161764898, 1.607809724025055e-19)
Tumblr media
Finally The Code
# -*- coding: utf-8 -*- """ Created on Fri Jul 17 16:01:24 2020
@author: ASUS """
import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import seaborn import scipy
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate #  lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
# Regression Modelling
print ('==================================') print ('Regression Modelling') print ('==================================')
# centring explanatory variable urbanrate and storing it into variable urb_c print ('Centring explanatory variable urbanrate and storing it into variable urb_c') urb_m = data['urbanrate'].mean() print ('mean of urbanrate is ') print (urb_m)
def urb_c (row):    return row['urbanrate']-urb_m
data['urb_c'] = data.apply (lambda row: urb_c (row), axis=1)
print ('Describing urb_c') temp = data['urb_c'].describe() print (temp)    
data_c=data.dropna()
print ('Finally the regression model') reg1 = smf.ols(formula = 'lifeexpectancy ~ urb_c', data=data_c).fit() print (reg1.summary())
scat3 = seaborn.regplot(x="urb_c", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate centered at 0') plt.ylabel('Life-expectancy')
print ('Running Pearsons Correlation Test for getting value of correlation coefficient') print (scipy.stats.pearsonr(data_c['urb_c'], data_c['lifeexpectancy']))
0 notes
shskpadhy · 5 years ago
Text
Assignment 03-01 (Writing About Your Data)
Sample
The sample is taken from the GapMinder dataset.
As collective information is taken about each country (observation), i.e the individuals are the countries, it is an aggregate level analysis.
Data Collection Procedure
The goal of GapMinder Foundations is to fight devastating ignorance with a fact based world view that everyone can understand, using inferences from the data collected.
The dataset  contains data on variables like life-expectancy, urban-rate, per-capita alcohol consumption (for people above 15 years), income per person, HIV rate, number of breast-cancer cases per 100 thousand women, etc for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas.
The data is collected by GapMinder from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank. 
As the GapMinder foundations has not carried out any survey or experiment on its own, nor has it observed the population, it can be concluded that the study design generating the data is data-reporting.
The data regarding different variables of the dataset was taken from different sources at different times, there is no information regarding when the whole data was collected.
Clearly the data is not an experimental data as no explanatory variable is manipulated. Rather it is an observational data.
Variables
Tumblr media
0 notes
shskpadhy · 5 years ago
Text
Assignment 03-01 (Writing About Your Data)
Sample
The GapMinder dataset is used.
Data Collection Procedure
The dataset contains data on variables like life-expectancy, urban-rate, per-capita alcohol consumption (for people above 15 years), income per person, HIV rate, number of breast-cancer cases per 100 thousand women, etc for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas.
The data is collected by GapMinder from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.
Clearly the data is not an experimental data as no explanatory variable is manipulated. Rather it is an observational data.
Variables
alcconsumption : it measures the amount of pure alcohol consumed by an individual (age 15+) in litres in a year. 
0 notes
shskpadhy · 5 years ago
Text
Assignment 02_04 (Course-02, Week-04 : Testing A Potential Moderator)
Dataset : GapMinder
Variiables
urbanrate (explanatory variable)
lifeexpectancy (response variable)
alcgrps : alcconsumption collapsed into 4 groups containing 1st, 2nd, 3rd and 4th quartiles 
Summary
Tumblr media
The last row shows the correlation coefficient and p-value for the Pearson Correlation Coefficient Test where the whole dataset is considered.
The direction of the association does not appear to change with the moderation variable but the strength appears to change slightly.
Thus the moderator does not strongly alter the association between the two variables.
Output
Rows 213 columns 16 =============================================================== urbanrate vs lifeexpectancy without moderator ===============================================================
Overall analysis
Tumblr media
association between urbanrate and lifeexpectancy (0.6075222955616916, 1.2247333760171806e-05)
=============================================================== urbanrate vs lifeexpectancy with alcconsumption as moderator =============================================================== alcgrps=1 : upto 25%ile from bottom
Tumblr media
(0.6235967761519358, 3.6639464319432514e-06) --------------------------------------------------------
alcgrps=1 : between 25%ile and 50%ile from bottom
Tumblr media
(0.45570230600060746, 0.0018802338429809984) ---------------------------------------
alcgrps=1 : between 50%ile and 75%ile from bottom
Tumblr media
(0.5786487797163785, 5.9669496046994585e-05) ---------------------------------------
alcgrps=1 : between 75%ile and 100%ile from bottom
Tumblr media
(0.6075222955616916, 1.2247333760171806e-05) ---------------------------------------
Finally The Code
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate #  lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
# urbanrate vs. lifeexpectancy
print ('===============================================================') print ('urbanrate vs lifeexpectancy without moderator') print ('===============================================================')
data_c=data_sub.dropna()
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('Scatterplot for the Association Between Urban-rate amd Life-expectancy')
print ('Overall analysis') print ('association between urbanrate and lifeexpectancy') print (scipy.stats.pearsonr(data_c['urbanrate'], data_c['lifeexpectancy']))
# urbanrate vs. lifeexpectancy with alcconsumption as moderator
print ('===============================================================') print ('urbanrate vs lifeexpectancy with alcconsumption as moderator') print ('===============================================================')
#temp = data['alcconsumption'].describe() #print (temp)
# collapsing alcconsumption into groups
def alcgrps (row):    if row['alcconsumption'] <= 2.625000:        return 1    elif row['alcconsumption'] <= 5.920000:        return 2    elif row['alcconsumption'] <= 9.925000:        return 3    else:        return 4
data['alcgrps'] = data.apply (lambda row: alcgrps (row), axis=1)
# alcgrpups = 1
print ('alcgrps=1 : upto 25%ile from bottom')
data_sub = data[(data['alcgrps']==1)] #print (data_sub['alcgrps'])#.describe())
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=1')
data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
print ('--------------------------------------------------------')
# alcgrpups = 2
print ('alcgrps=1 : between 25%ile and 50%ile from bottom')
data_sub = data[(data['alcgrps']==2)] #print (data_sub['alcgrps'])#.describe())
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=2')
data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
print ('---------------------------------------')
# alcgrpups = 3
print ('alcgrps=1 : between 50%ile and 75%ile from bottom')
data_sub = data[(data['alcgrps']==3)] #print (data_sub['alcgrps'])#.describe())
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=3')
data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
print ('---------------------------------------')
# alcgrpups = 4
print ('alcgrps=1 : between 75%ile and 100%ile from bottom')
data_sub = data[(data['alcgrps']==4)] #print (data_sub['alcgrps'])#.describe())
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=4')
data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
print ('---------------------------------------')
0 notes
shskpadhy · 5 years ago
Text
Assignment 02_03 (Course-02, Week-03)
Generating Correlation Coefficient
Dataset : GapMinder
Variables : 
urbanrate : % of population of the country living in urban areas
alcconsumption : per capita (age : 15+) alcohol consumption in a year
lifeexpectancy : how many years a normal new born baby would live in current situation
Research Question (Hypothesis)
Research Question 1
H0 : urbanrate and alcconsumption are independent of each other
HA : alcconsumption increases with urbanrate
Research Question 2
H0 : urbanrate and lifeexpectancy are independent of each other
HA : lifeexpectancy increases with urbanrate
Summary
Research Question 1
Value of correlation coefficient = 0.27446605904089333, thus there exists a positive correlation between urbanrate and alcconsumption, i.e. alcconsumption increases with urbanrate.
p-value = 0.00022753282212695448, which is less than 0.05
From points 1 and 2 above, it is concluded that H0 (null hypothesis) is rejected and HA (alternate hypothesis) is accepted
Although knowing the value of urbanrate, we can determine only 7.5331% variability in alcconsumption.
Research Question 2
Value of correlation coefficient = 0.6127112161764898, thus there exists a positive correlation between urbanrate and lifeexpectancy, i.e. lifeexpectamcy increases with urbanrate.
p-value = 1.607809724025055e-19, which is very much less than 0.05
From points 1 and 2 above, it is concluded that H0 (null hypothesis) is rejected and HA (alternate hypothesis) is accepted
Although knowing the value of urbanrate, we can determine only 37.5415% ariability in lifeexpectancy.
Output
Rows 213 columns 16 =================================== urbanrate vs alcconsumption =================================== Descriptive analysis
Tumblr media
---------------------------- Inferencial analysis association between urbanrate and alcconsumption (0.27446605904089333, 0.00022753282212695448)
=================================== urbanrate vs lifeexpectancy =================================== Descriptive analysis
Tumblr media
---------------------------- Inferencial analysis association between urbanrate and lifeexpectancy (0.6127112161764898, 1.607809724025055e-19)
Finally The Code
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate #  lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data_clean=data.dropna()
# urbanrate vs. alcconsumption
print ('===================================') print ('urbanrate vs alcconsumption') print ('===================================')
print ('Descriptive analysis') scat1 = seaborn.regplot(x="urbanrate", y="alcconsumption", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Per Capita Alcohol Consumption') plt.title('Scatterplot for the Association Between Urban Rate and Alcohol Consumption')
print ('----------------------------') print ('Inferencial analysis') print ('association between urbanrate and alcconsumption') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['alcconsumption']))
# urbanrate vs. lifeexpectancy
print ('===================================') print ('urbanrate vs lifeexpectancy') print ('===================================')
print ('Descriptive analysis') scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('Scatterplot for the Association Between Urban-rate amd Life-expectancy')
print ('----------------------------') print ('Inferencial analysis') print ('association between urbanrate and lifeexpectancy') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
0 notes
shskpadhy · 5 years ago
Text
Assignment-02_02 (Course-02, Module-02)
Running Chi-Square Independence Test
Dataset : GapMinder
Variables
urbanrate (urbgrps after collapsing into categories)
lifeexpectancy ( lifgrps after collapsing into categories)
urbgrps : urbanrate is collapsed into categories of width 20%, hence having 5 categories (1 for 0%-20%, 2 for 20% to 40%, .... and 5 for 80% to 100%)
lifgrps : It is a two-valued variable, with value 1 if the country has life-expectancy greater than or equal ti 65% and 0 otherwise. 
Research Question : Hypothesis
H0 : proportion of countries in each category of urbgrps having life-expectancy of 65+ years is equal
HA : The proportion is unequal for atleast two groups
Summary
The p-value for the Chi-square test is 0.000008, which is significantly less than 0.05. Hence we can reject the null hypothesis.
The following table shows the p-values of chi-square tests for each pair of categories.
Tumblr media
There are 10 comparisons, hence our threshold p-value must be 0.05/10 = 0.005
It is clear that category-4 (countries with urban-rate between 60% and 80%) has significantly different life-expectancy
Output
Rows 213 columns 16 ===================================== Descriptive Statistical Analysis =====================================
Tumblr media
===================================== Inferencial Statistical Analysis ===================================== urbgrps  1   2   3   4   5 lifgrps                   0        9  24  17   6  16 1        4  22  29  52  34 G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:46: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:47: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:48: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:75: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  return 0 urbgrps        1        2        3        4        5 lifgrps                                             0       0.692308 0.521739 0.369565 0.103448 0.320000 1       0.307692 0.478261 0.630435 0.896552 0.680000 chi-square value, p value, expected counts (28.770249676716467, 8.704036143643654e-06, 4L, array([[ 4.3943662 , 15.54929577, 15.54929577, 19.6056338 , 16.90140845],       [ 8.6056338 , 30.45070423, 30.45070423, 38.3943662 , 33.09859155]])) ============================= Post-hoc test New threshold for p-value : 0.005 1 vs 2 comp1v2  1.000000  2.000000 lifgrps                     0               9        24 1               4        22 comp1v2  1.000000  2.000000 lifgrps                     0        0.692308  0.521739 1        0.307692  0.478261 chi-square value, p value, expected counts (0.6044210109845561, 0.43689616446201507, 1L, array([[ 7.27118644, 25.72881356],       [ 5.72881356, 20.27118644]])) 1 vs 3 comp     1.000000  3.000000 lifgrps                     0               9        17 1               4        29 comp     1.000000  3.000000 lifgrps                     0        0.692308  0.369565 1        0.307692  0.630435 chi-square value, p value, expected counts (3.073965958790373, 0.07955517016836518, 1L, array([[ 5.72881356, 20.27118644],       [ 7.27118644, 25.72881356]])) 1 vs 4 comp     1.000000  4.000000 lifgrps                     0               9         6 1               4        52 comp     1.000000  4.000000 lifgrps                     0        0.692308  0.103448 1        0.307692  0.896552 chi-square value, p value, expected counts (18.70646985916383, 1.524642950087753e-05, 1L, array([[ 2.74647887, 12.25352113],       [10.25352113, 45.74647887]])) 1 vs 5 comp     1.000000  5.000000 lifgrps                     0               9        16 1               4        34 comp     1.000000  5.000000 lifgrps                     0        0.692308  0.320000 1        0.307692  0.680000 chi-square value, p value, expected counts (4.520721862348178, 0.03348669970950359, 1L, array([[ 5.15873016, 19.84126984],       [ 7.84126984, 30.15873016]])) 2 vs 3 comp     2.000000  3.000000 lifgrps                     0              24        17 1              22        29 comp     2.000000  3.000000 lifgrps                     0        0.521739  0.369565 1        0.478261  0.630435 chi-square value, p value, expected counts (1.5839311334289814, 0.20819535296566563, 1L, array([[20.5, 20.5],       [25.5, 25.5]])) 2 vs 4 comp     2.000000  4.000000 lifgrps                     0              24         6 1              22        52 comp     2.000000  4.000000 lifgrps                     0        0.521739  0.103448 1        0.478261  0.896552 chi-square value, p value, expected counts (19.878233856044954, 8.253469598112272e-06, 1L, array([[13.26923077, 16.73076923],       [32.73076923, 41.26923077]])) 2 vs 5 comp     2.000000  5.000000 lifgrps                     0              24        16 1              22        34 comp     2.000000  5.000000 lifgrps                     0        0.521739  0.320000 1        0.478261  0.680000 chi-square value, p value, expected counts (3.2246459627329176, 0.07253748022033618, 1L, array([[19.16666667, 20.83333333],       [26.83333333, 29.16666667]])) 3 vs 4 comp     3.000000  4.000000 lifgrps                     0              17         6 1              29        52 comp     3.000000  4.000000 lifgrps                     0        0.369565  0.103448 1        0.630435  0.896552 chi-square value, p value, expected counts (9.059129050611572, 0.0026138645525417546, 1L, array([[10.17307692, 12.82692308],       [35.82692308, 45.17307692]])) 3 vs 5 comp     3.000000  5.000000 lifgrps                     0              17        16 1              29        34 comp     3.000000  5.000000 lifgrps                     0        0.369565  0.320000 1        0.630435  0.680000 chi-square value, p value, expected counts (0.08745341614906833, 0.7674399219661803, 1L, array([[15.8125, 17.1875],       [30.1875, 32.8125]])) 4 vs 5 comp     4.000000  5.000000 lifgrps                     0               6        16 1              52        34 comp     4.000000  5.000000 lifgrps                     0        0.103448  0.320000 1        0.896552  0.680000 chi-square value, p value, expected counts (6.485275205948824, 0.01087716976280061, 1L, array([[11.81481481, 10.18518519],       [46.18518519, 39.81481481]]))
Finslly The Code
import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import scipy.stats
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate #  lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
# Categorizing urbanrate as urbgrps
def urbgrps (row):   if row['urbanrate'] <= 20 :      return 1   elif row['urbanrate'] <= 40 :      return 2   elif row['urbanrate'] <= 60 :      return 3   elif row['urbanrate'] <= 80 :      return 4   else :      return 5 data2['urbgrps'] = data2.apply (lambda row: urbgrps (row),axis=1) data2["urbgrps"] = data2["urbgrps"].astype('category')
# Categorizing lifeexpectancy as lifgrps
def lifgrps (row):   if row['lifeexpectancy'] >= 65 :      return 1   else :      return 0   data2['lifgrps'] = data2.apply (lambda row: lifgrps (row),axis=1) # Setting urbgrps to numeric data2['lifgrps'] = data2['lifgrps'].convert_objects(convert_numeric=True)
# Descriptive Analysis
print ('=====================================') print ('Descriptive Statistical Analysis') print ('=====================================') seaborn.factorplot(x='urbgrps', y='lifgrps', data=data2, kind="bar", ci=None) plt.xlabel('Urban rate') plt.ylabel('Proportion of countries in group with lifeexpectancy less than 65 years')
#Inferencial Statistic
print ('=====================================') print ('Inferencial Statistical Analysis') print ('=====================================')
# contingency table of observed counts ct1=pandas.crosstab(data2['lifgrps'], data2['urbgrps']) print (ct1)
# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
# chi-square value, p value, expected counts print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)
# Post-hoc Test
print ('=============================') print ('Post-hoc test') print ('New threshold for p-value : ') print (0.05/10) print ('1 vs 2') recode1_2 = {1:1, 2:2} data2['comp1v2'] = data2['urbgrps'].map(recode1_2)
ct1_2 = pandas.crosstab(data2['lifgrps'], data2['comp1v2']) print (ct1_2)
colsum1_2 = ct1_2.sum(axis=0) colpct1_2 = ct1_2/colsum1_2 print (colpct1_2)
print ('chi-square value, p value, expected counts') cs1_2 = scipy.stats.chi2_contingency(ct1_2) print (cs1_2)
# 1 vs 3 print ('1 vs 3') recode = {1:1, 3:3} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 1 vs 4 print ('1 vs 4') recode = {1:1, 4:4} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 1 vs 5 print ('1 vs 5') recode = {1:1, 5:5} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 2 vs 3 print ('2 vs 3') recode = {2:2, 3:3} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 2 vs 4 print ('2 vs 4') recode = {2:2, 4:4} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 2 vs 5 print ('2 vs 5') recode = {2:2, 5:5} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 3 vs 4 print ('3 vs 4') recode = {3:3, 4:4} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 3 vs 5 print ('3 vs 5') recode = {3:3, 5:5} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 4 vs 5 print ('4 vs 5') recode = {4:4, 5:5} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
0 notes
shskpadhy · 5 years ago
Text
Assignment-2.1 (Course-02 : Week-01)
Course – 02, Week - 01
Assignment – 2_1
 Dataset : Gapminder
Variables under consideration :
·       urbanrate (explanatory variable for both research questions) : it is collapsed into groups of [0-10], [10-20],…………[80-90] and [90-100].
·       alcconsumption (responsive variable for 1st research question)
·       lifeexpectancy (responsive variable for 2nd research question)
Redefining Research Questions and Related Hypothesis
Research Question – 01
The per capita alcohol consumption of a country depends on its urban rate.
H0 : For all groups of the urban-rate, the group-wise mean per-capita alcohol consumptions are equal
HA : The mean per-capita alcohol consumption of all the groups are not equal
Research Question – 02
The average life-expectancy of new born baby in a country depends on its urban rate.
H0 : For all groups of the urban-rate, the group-wise mean life expectancies are equal
HA : The mean life expectancies of all the groups are not equal
 Code
# -*- coding: utf-8 -*-
"""
Created on Fri Jun 19 16:35:22 2020
 @author: ASUS
"""
 # Assignment_3
  import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
 #importing data
data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
 #Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
 # bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
 #printing number of rows and columns
print ('Rows')
print (len(data))
print ('columns')
print (len(data.columns))
 #------- Variables under consideration------#
# alcconsumption
# urbanrate
#  lifeexpectancy
 # Setting values to numeric
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data2 = data
 # urbanrate
 data2['urbgrps']=pandas.cut(data2.urbanrate,[0,10,20,30,40,50,60,70,80,90,100])
data2["urbgrps"] = data2["urbgrps"].astype('category')
 # ------Analysis -------#
 #------Urbgrps vs. alcconsumption---------#
print ('Urbgrps vs. alcconsumption')
#-----Descriptive analysis----#
print ('Descriptive data analysis : ')
print ('C->Q bar graph')
seaborn.factorplot(x='urbgrps', y='alcconsumption', data=data2, kind="bar", ci=None)
plt.ylabel('Per capita alcohol consumption in a year')
plt.title('Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption')
print ('Seems H0 is to be rejected')
#----Unferential statistics-----#
print ('ANOVA-F test')
sub1_1 = data2[['alcconsumption', 'urbgrps']].dropna()
model1 = smf.ols(formula='alcconsumption ~ C(urbgrps)', data=sub1_1).fit()
print (model1.summary())
m1_1= sub1_1.groupby('urbgrps').mean()
print (m1_1)
#-----Post hoc----#
print ('Post-hoc test')
post_h_1 = multi.MultiComparison(sub1_1['alcconsumption'], sub1_1['urbgrps'])
res1 = post_h_1.tukeyhsd()
print(res1.summary())
 # ------- urbgrps vs. lifexpectancy -------#
print ('urbgrps vs. lifexpectancy')
print ('Descriptive statistical analysis')
print ('C->Q bar graph')
seaborn.factorplot(x='urbgrps', y='lifeexpectancy', data=data2, kind="bar", ci=None)
plt.ylabel('Life Expectancy')
plt.title('Scatterplot for the Association Between Urban Rate and life expectancy')
#----Unferential statistics-----#
print ('ANOVA-F test')
sub1_2 = data2[['lifeexpectancy', 'urbgrps']].dropna()
model2 = smf.ols(formula='lifeexpectancy ~ C(urbgrps)', data=sub1_2).fit()
print (model2.summary())
m1_2= sub1_2.groupby('urbgrps').mean()
print (m1_2)
#-----Post hoc----#
print ('Post-hoc test')
post_h_2 = multi.MultiComparison(sub1_2['lifeexpectancy'], sub1_2['urbgrps'])
res2 = post_h_2.tukeyhsd()
print(res2.summary())
 Output
Rows
213
columns
16
Urbgrps vs. alcconsumption
Descriptive data analysis :
C->Q bar graph
Tumblr media
Seems H0 is to be rejected
ANOVA-F test
__main__:40: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
__main__:41: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
__main__:42: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
                          OLS Regression Results                            
==============================================================================
Dep. Variable:         alcconsumption   R-squared:                       0.143
Model:                            OLS   Adj. R-squared:                  0.103
Method:                 Least Squares   F-statistic:                     3.625
Date:               Tue, 07 Jul 2020   Prob (F-statistic):           0.000634
Time:                       15:10:23   Log-Likelihood:                -537.03
No. Observations:                 183   AIC:                             1092.
Df Residuals:                     174   BIC:                             1121.
Df Model:                           8                                        
Covariance Type:           nonrobust                                        
===================================================================================================================
                                                    coef    std err          t     P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------
Intercept                                          5.8516      0.328     17.843      0.000       5.204       6.499
C(urbgrps)[T.Interval(10, 20, closed='right')]     -0.5483      1.249     -0.439      0.661      -3.014       1.917
C(urbgrps)[T.Interval(20, 30, closed='right')]     -1.9921     0.949     -2.100      0.037     -3.865      -0.119
C(urbgrps)[T.Interval(30, 40, closed='right')]     -1.1990      0.930     -1.289      0.199      -3.035       0.637
C(urbgrps)[T.Interval(40, 50, closed='right')]      0.1864      0.990     0.188      0.851     -1.767       2.140
C(urbgrps)[T.Interval(50, 60, closed='right')]      1.6110      0.930     1.731      0.085      -0.225       3.447
C(urbgrps)[T.Interval(60, 70, closed='right')]      3.0216      0.819     3.691      0.000       1.406       4.637
C(urbgrps)[T.Interval(70, 80, closed='right')]      2.0307      0.949     2.140      0.034       0.158       3.903
C(urbgrps)[T.Interval(80, 90, closed='right')]      3.2649      0.990     3.299      0.001       1.312       5.218
C(urbgrps)[T.Interval(90, 100, closed='right')]    -0.5236     1.361     -0.385      0.701     -3.209       2.162
==============================================================================
Omnibus:                        8.172   Durbin-Watson:                   1.832
Prob(Omnibus):                  0.017   Jarque-Bera (JB):                8.039
Skew:                           0.500   Prob(JB):                       0.0180
Kurtosis:                       3.233   Cond. No.                     1.37e+16
==============================================================================
 Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.1e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
          alcconsumption
urbgrps                
(0, 10]               nan
(10, 20]         5.303333
(20, 30]         3.859545
(30, 40]         4.652609
(40, 50]         6.038000
(50, 60]         7.462609
(60, 70]         8.873226
(70, 80]         7.882273
(80, 90]         9.116500
(90, 100]       5.328000
Post-hoc test
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
========================================================
group1    group2 meandiff p-adj   lower  upper reject
--------------------------------------------------------
(10, 20]  (20, 30]  -1.4438   0.9 -6.7067 3.8191  False
(10, 20]  (30, 40]  -0.6507   0.9 -5.8731 4.5716  False
(10, 20]  (40, 50]   0.7347   0.9 -4.6203 6.0896  False
(10, 20]  (50, 60]   2.1593   0.9 -3.0631 7.3816  False
(10, 20]  (60, 70]   3.5699 0.381 -1.4161 8.5559  False
(10, 20]  (70, 80]   2.5789 0.8146  -2.684 7.8418 False
(10, 20]  (80, 90]   3.8132 0.389 -1.5418 9.1681  False
(10, 20] (90, 100]   0.0247    0.9 -6.2546 6.3039  False
(20, 30]  (30, 40]   0.7931   0.9 -3.5803 5.1665  False
(20, 30]  (40, 50]   2.1785 0.8319 -2.3525 6.7094  False
(20, 30]  (50, 60]   3.6031 0.1993 -0.7703 7.9765  False
(20, 30]  (60, 70]   5.0137 0.0051  0.9255 9.1019   True
(20, 30]  (70, 80]   4.0227 0.1067  -0.399 8.4445 False
(20, 30]  (80, 90]    5.257 0.0104   0.726 9.7879   True
(20, 30] (90, 100]   1.4685    0.9 -4.1246 7.0615  False
(30, 40]  (40, 50]   1.3854   0.9 -3.0984 5.8692  False
(30, 40]  (50, 60]     2.81 0.5145 -1.5145 7.1345  False
(30, 40]  (60, 70]   4.2206 0.0329  0.1847 8.2565   True
(30, 40]  (70, 80]   3.2297 0.3364 -1.1437 7.6031  False
(30, 40]  (80, 90]   4.4639 0.052 -0.0199 8.9477  False
(30, 40] (90, 100]   0.6754    0.9 -4.8796 6.2304  False
(40, 50]  (50, 60]   1.4246   0.9 -3.0592 5.9084  False
(40, 50]  (60, 70]   2.8352 0.4672 -1.3709 7.0413  False
(40, 50]  (70, 80]   1.8443   0.9 -2.6866 6.3752  False
(40, 50]  (80, 90]   3.0785 0.4877  -1.559 7.716  False
(40, 50] (90, 100]   -0.71    0.9 -6.3898 4.9698  False
(50, 60]  (60, 70]   1.4106   0.9 -2.6253 5.4465  False
(50, 60]  (70, 80]   0.4197   0.9 -3.9537 4.7931  False
(50, 60]  (80, 90]   1.6539   0.9 -2.8299 6.1377  False
(50, 60] (90, 100] -2.1346    0.9 -7.6896 3.4204  False
(60, 70]  (70, 80]   -0.991   0.9 -5.0792 3.0973  False
(60, 70]  (80, 90]   0.2433   0.9 -3.9628 4.4494  False
(60, 70] (90, 100] -3.5452 0.4859 -8.8786 1.7881 False
(70, 80]  (80, 90]   1.2342   0.9 -3.2967 5.7651  False
(70, 80] (90, 100] -2.5543 0.8772 -8.1474 3.0388 False
(80, 90] (90, 100]  -3.7885 0.4813 -9.4683 1.8913  False
urbgrps vs. lifexpectancy
Descriptive statistical analysis
C->Q bar graph
Tumblr media
ANOVA-F test
                          OLS Regression Results                            
==============================================================================
Dep. Variable:         lifeexpectancy   R-squared:                       0.406
Model:                            OLS   Adj. R-squared:                  0.380
Method:                 Least Squares   F-statistic:                     15.32
Date:               Tue, 07 Jul 2020   Prob (F-statistic):           4.76e-17
Time:                       15:25:26   Log-Likelihood:                -644.54
No. Observations:                 188   AIC:                             1307.
Df Residuals:                     179   BIC:                             1336.
Df Model:                           8                                        
Covariance Type:           nonrobust                                        
===================================================================================================================
                                                    coef    std err          t     P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------
Intercept                                         62.4438      0.521    119.810     0.000      61.415      63.472
C(urbgrps)[T.Interval(10, 20, closed='right')]     -1.7713      2.041     -0.868      0.387      -5.800       2.257
C(urbgrps)[T.Interval(20, 30, closed='right')]      0.2834      1.548     0.183      0.855      -2.772       3.338
C(urbgrps)[T.Interval(30, 40, closed='right')]     -1.7116      1.548     -1.106      0.270      -4.767       1.343
C(urbgrps)[T.Interval(40, 50, closed='right')]      4.5547      1.615     2.820      0.005       1.367       7.742
C(urbgrps)[T.Interval(50, 60, closed='right')]      7.0269      1.518     4.629      0.000       4.031      10.022
C(urbgrps)[T.Interval(60, 70, closed='right')]     10.4459      1.283     8.140      0.000       7.914      12.978
C(urbgrps)[T.Interval(70, 80, closed='right')]     13.4776      1.548     8.706      0.000      10.423      16.533
C(urbgrps)[T.Interval(80, 90, closed='right')]     13.9092      1.694     8.212      0.000      10.567      17.252
C(urbgrps)[T.Interval(90, 100, closed='right')]    16.2290     1.841      8.816      0.000     12.597      19.861
==============================================================================
Omnibus:                        8.034   Durbin-Watson:                   1.953
Prob(Omnibus):                  0.018   Jarque-Bera (JB):                8.379
Skew:                         -0.515   Prob(JB):                       0.0152
Kurtosis:                       2.902   Cond. No.                     7.24e+15
==============================================================================
 Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.02e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
          lifeexpectancy
urbgrps                
(0, 10]               nan
(10, 20]       60.672500
(20, 30]       62.727136
(30, 40]       60.732182
(40, 50]       66.998450
(50, 60]       69.470652
(60, 70]       72.889647
(70, 80]       75.921364
(80, 90]       76.353000
(90, 100]       78.672733
Post-hoc test
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
=========================================================
group1    group2 meandiff p-adj   lower   upper reject
---------------------------------------------------------
(10, 20]  (20, 30]   2.0546   0.9   -6.56 10.6693  False
(10, 20]  (30, 40]   0.0597   0.9  -8.555  8.6743 False
(10, 20]  (40, 50]   6.3259 0.3698 -2.4394 15.0913  False
(10, 20]  (50, 60]   8.7982 0.0384  0.2499 17.3464   True
(10, 20]  (60, 70]  12.2171 0.001  4.1569 20.2774   True
(10, 20]  (70, 80]  15.2489 0.001  6.6342 23.8635   True
(10, 20]  (80, 90]  15.6805 0.001  6.7344 24.6266   True
(10, 20] (90, 100] 18.0002  0.001  8.7032 27.2973   True
(20, 30]  (30, 40]   -1.995   0.9 -9.2327  5.2428  False
(20, 30]  (40, 50]   4.2713 0.6536 -3.1452 11.6878  False
(20, 30]  (50, 60]   6.7435 0.0826 -0.4151 13.9022  False
(20, 30]  (60, 70]  10.1625 0.001  3.5944 16.7307   True
(20, 30]  (70, 80]  13.1942 0.001  5.9565  20.432   True
(20, 30]  (80, 90]  13.6259 0.001  5.9966 21.2551   True
(20, 30] (90, 100] 15.9456  0.001  7.9077 23.9835   True
(30, 40]  (40, 50]   6.2663 0.1729 -1.1502 13.6828  False
(30, 40]  (50, 60]   8.7385 0.0054  1.5798 15.8971   True
(30, 40]  (60, 70]  12.1575 0.001  5.5893 18.7256   True
(30, 40]  (70, 80]  15.1892 0.001  7.9514 22.4269   True
(30, 40]  (80, 90]  15.6208 0.001  7.9916 23.2501   True
(30, 40] (90, 100] 17.9406  0.001  9.9026 25.9785   True
(40, 50]  (50, 60]   2.4722   0.9 -4.8671  9.8115  False
(40, 50]  (60, 70]   5.8912 0.1431 -0.8734 12.6558  False
(40, 50]  (70, 80]   8.9229 0.0065  1.5064 16.3394   True
(40, 50]  (80, 90]   9.3546 0.0068  1.5555 17.1536   True
(40, 50] (90, 100] 11.6743  0.001   3.475 19.8735   True
(50, 60]  (60, 70]    3.419 0.7444 -3.0619  9.8999 False
(50, 60]  (70, 80]   6.4507 0.1144 -0.7079 13.6094  False
(50, 60]  (80, 90]   6.8823 0.1057 -0.6719 14.4366  False
(50, 60] (90, 100]   9.2021  0.011  1.2353 17.1689   True
(60, 70]  (70, 80]   3.0317 0.8683 -3.5364  9.5999 False
(60, 70]  (80, 90]   3.4634 0.8057 -3.5339 10.4606  False
(60, 70] (90, 100]   5.7831 0.2689 -1.6576 13.2238 False
(70, 80]  (80, 90]   0.4316   0.9 -7.1976  8.0609  False
(70, 80] (90, 100]   2.7514    0.9 -5.2866 10.7893  False
(80, 90] (90, 100]   2.3197    0.9 -6.0725 10.7119  False
---------------------------------------------------------
  Result
Research Question 01
We see that F-value of the ANOVA-F test is 3.625 and the corresponding p-value is 0.000634, which is less than 0.05. Thus the chance of wrongly rejecting the null hypothesis (H0) is very less. Hence we can reject the null hypothesis and accept the alternate hypothesis, i.e. we conclude that alcconsumption depends on urban-rate.
From results of the post-hoc test, we see that the mean per-capita alcohol consumption is unequal for the following groups:
·       [20-30] and [60-70]
·       [30-40] and [60-70]
·       [20-30] and [80-90]
Research Question 02
We see that F-value of the ANOVA-F test is 15.32 and the corresponding p-value is 4.76e-17, which is less than 0.05. Thus the chance of wrongly rejecting the null hypothesis (H0) is very less. Hence we can reject the null hypothesis and accept the alternate hypothesis, i.e. we conclude that lfe-expectancy depends on urban-rate.
From results of the post-hoc test, we see that the mean per-capita alcohol consumption is unequal for the following groups:
·       [50-60] and [90-100]
·       [20-30] and [90-100]
·       [20-30] and [80-90]
·       And many more
0 notes
shskpadhy · 5 years ago
Text
Assignment_4
Tumblr media
Program :
# Assignment_4
import pandas import numpy import seaborn import matplotlib.pyplot as plt
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate #  lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data2 = data
# alconsumption
# initial F.D. print ('alcconsumption : alcohol consumption per adult (age 15+) in litres') print ('Description and F.D. : histogram') d1_1 = data['alcconsumption'].describe() print (d1_1) seaborn.distplot(data["alcconsumption"].dropna(), kde=True); plt.xlabel('Countries') plt.title('Per Capita Alcohol Consumption in a year') print ('Description and F.D. after collapsing into groups : bargraph') data2['alcgrps']=pandas.cut(data2.alcconsumption,[0,2.5,5.0,7.5,10.0,12.5,15.0,17.5,20.0,22.5, 25]) data2["alcgrps"] = data2["alcgrps"].astype('category') d1_2 = data2['alcgrps'].describe() print (d1_2) seaborn.countplot(x="alcgrps", data=data2) plt.xlabel('Countries') plt.title('Per Capita Alcohol Consumption in a year')
# urbanrate
print ('---------------------------') print ('urbanrate : Percentage of population living in urban areas') print ('Description and F.D. : histogram') d2_1 = data['urbanrate'].describe() print (d2_1) seaborn.distplot(data["urbanrate"].dropna(), kde=True); plt.xlabel('Countries') plt.title('Urbanrate') print ('Description and F.D. after collapsing into groups : bargraph') data2['urbgrps']=pandas.cut(data2.urbanrate,[0,10,20,30,40,50,60,70,80,90,100]) data2["urbgrps"] = data2["urbgrps"].astype('category') d1_2 = data2['utbgrps'].describe() print (d2_2) seaborn.countplot(x="urbgrps", data=data2) plt.xlabel('Countries') plt.title('Urbanrate')
# lifeexpectancy
print ('---------------------------') print ('lifeexpectancy : years in avg a new born baby would live in current situation') seaborn.distplot(data["lifeexpectancy"].dropna(), kde=True); plt.xlabel('Countries') plt.title('Life expectancy') print ('Description and F.D. after collapsing into groups : bargraph') print ('Description and F.D. after collapsing into groups : bargraph') data2['lifgrps']=pandas.cut(data2.lifeexpectancy,[0,50,60,70,80,90,100]) data2["urbgrps"] = data2["lifgrps"].astype('category') seaborn.countplot(x="lifgrps", data=data2) plt.xlabel('Countries') plt.title('Life Expectancy')
# ---- Plotting bi-variate graphd -------#
print ('----------------------') print ('x=urbanrate, y=alcconsumption') print ('Q->Q scatter chart') scat1 = seaborn.regplot(x="urbanrate", y="alcconsumption", data=data2) plt.xlabel('Urban Rate') plt.ylabel('Per capita alcohol consumption in a year') plt.title('Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption') print ('C->Q bar graph') seaborn.factorplot(x='urbgrps', y='alcconsumption', data=data2, kind="bar", ci=None) plt.ylabel('Per capita alcohol consumption in a year') plt.title('Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption')
print ('x=urbanrate, y=lifexpecatancy') print ('Q->Q scatter chart') scat1 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", data=data2) plt.xlabel('Urban Rate') plt.ylabel('Life Expectancy') plt.title('Scatterplot for the Association Between Urban Rate and life expectancy') print ('C->Q bar graph') seaborn.factorplot(x='urbgrps', y='lifeexpectancy', data=data2, kind="bar", ci=None) plt.ylabel('Life Expectancy') plt.title('Scatterplot for the Association Between Urban Rate and life expectancy')
Output and Description 
Univariate Graphs
Alcconsumption:
alcconsumption : alcohol consumption per adult (age 15+) in litres Description and F.D. : histogram count   187.000000 mean      6.689412 std       4.899617 min       0.030000 25%       2.625000 50%       5.920000 75%       9.925000 max      23.010000 Name: alcconsumption, dtype: float64
 Description and F.D. after collapsing into groups : bargraph count            187 unique             9 top       (0.0, 2.5] freq              45 Name: alcgrps, dtype: object
Tumblr media Tumblr media
Description:
Centre : 6.689412   (mean), mode in between 2.5 and 5
Spread :  4.899617
The number of countries with higher per-capita alcohol consumption goes on decreasing as can be inferred from the graph.
Urbanrate:
urbanrate : Percentage of population living in urban areas Description and F.D. : histogram count   203.000000 mean     56.769360 std      23.844933 min      10.400000 25%      36.830000 50%      57.940000 75%      74.210000 max     100.000000 Name: urbanrate, dtype: float64 Out[58]: Text(0.5,1,'Urbanrate') 
Tumblr media
Description and F.D. after collapsing into groups : bargraph count          203 unique           9 top       (60, 70] freq            34 Name: urbgrps, dtype: object Out[61]: Text(0.5,1,'Urbanrate')
Tumblr media
Description:
Centre :  56.769360 (mean), mode somewhat between 70 and 80.
spread :   23.844933
It seems that countries aare equally spread in the buckets representing urban rate of less than and greater than mean urban rate. 
Lifexpectency:
Tumblr media Tumblr media
Description:
Skewed left
Centre : between 70 to 80
Bivariate Graphs
x=urbanrate, y=alcconsumption Q->Q scatter chart Out[64]: Text(0.5,1,'Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption')
Tumblr media
C->Q bar graph Out[65]: Text(0.5,1,'Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption')
Tumblr media
Description:
It can be clearly seen that the per-capita alcohol consumption of a country increases with the increase in its urban rate.
Hence the given dataset supports the fact that the hypothesis posed at the beginning of the course (assignment for module-1) is true.
The hypothesis was as follows: The more is the urban-rate of a country, the more is its per-capita alcohol consumption.
Output:
x=urbanrate, y=lifexpecatancy Q->Q scatter chart Out[66]: Text(0.5,1,'Scatterplot for the Association Between Urban Rate and life expectancy')
Tumblr media
C->Q bar graph Out[67]: Text(0.5,1,'Scatterplot for the Association Between Urban Rate and life expectancy')
Tumblr media
Desctiption:
It is clear from the graphs that the life-expectancy increases with increase in urban-rate.
0 notes
shskpadhy · 5 years ago
Text
Assignment_3
Week 03
Making Data Management Decissions
Dataset : GapMinder
Variables chosen : ‘alcconsumption’, ‘urbanrate’ and ‘lofeexpectancy’
Program
1.  # Assignment_3
2.   
3.   
4.  import pandas
5.  import numpy
6.   
7.  #importing data
8.  data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
9.   
10. #printing number of rows and columns
11. print ('Rows')
12. print (len(data))
13. print ('columns')
14. print (len(data.columns))
15.  
16. #------- Variables under consideration------#
17. # alcconsumption
18. # urbanrate
19. #  lifeexpectancy
20.  
21. # Setting values to numeric
22. data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
23. data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
24. data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
25. data2 = data
26.  
27. # alconsumption
28.  
29. # initial F.D.
30. print ('alcconsumption : alcohol consumption per adult (age 15+) in litres')
31. c1_min = data['alcconsumption'].min()
32. c1_max = data['alcconsumption'].max()
33. print ('min and max value of alcconsumption : ')
34. print (c1_min)
35. print (c1_max)
36. # Step 1 : Setting aside missing data (not required)
37. # Step 2 :  coding missing data (not required)
38. # Step 3 : creating secondary variables (not required)
39. # Step 4  : Grouping Values within individual variables
40. data2['alcgrps']=pandas.cut(data2.alcconsumption,[0,2.5,5.0,7.5,10.0,12.5,15.0,17.5,20.0,22.5, 25])
41. print ('F.D. of groups of values of variable alcconsumption :')
42. c1_grp = data2['alcgrps'].value_counts(sort=False, dropna=False)
43. print (c1_grp)
44. print ('Percentage (with NaN set aside) :')
45. p1_grp = data2['alcgrps'].value_counts(sort=False, dropna=True, normalize=True)
46. print (p1_grp)
47.  
48. # urbanrate
49.  
50. # Initial F.D.
51. print ('---------------------------')
52. print ('urbanrate : Percentage of population living in urban areas')
53. c2_min = data2['urbanrate'].min()
54. c2_max = data2['urbanrate'].max()
55. print ('min and max values : ')
56. print (c2_min)
57. print (c2_max)
58. data2['urbgrps']=pandas.cut(data2.urbanrate,[0,10,20,30,40,50,60,70,80,90,100])
59. c2_grp = data2['urbgrps'].value_counts(sort=False, dropna=False)
60. print ('F.D. of groups of values of variable urbanrate :')
61. print (c2_grp)
62. print ('Percentage (with NaN set aside) : ')
63. p2_grp = data2['urbgrps'].value_counts(sort=False, dropna=True, normalize=True)
64. print (p2_grp)
65.  
66. # lifeexpectancy
67.  
68. # Initial F.D.
69. print ('---------------------------')
70. print ('lifeexpectancy : years in avg a new born baby would live in current situation')
71. c3_min = data2['lifeexpectancy'].min()
72. c3_max = data2['lifeexpectancy'].max()
73. print ('min and max values : ')
74. print (c3_min)
75. print (c3_max)
76. data2['lifgrps']=pandas.cut(data2.lifeexpectancy,[0,50,60,70,80,90,100])
77. c3_grp = data2['lifgrps'].value_counts(sort=False, dropna=False)
78. print ('F.D. of groups of values of variable lifeexpectancy :')
79. print (c3_grp)
80. print ('Percentage (with NaN set aside) : ')
81. p3_grp = data2['lifgrps'].value_counts(sort=False, dropna=True, normalize=True)
82. print (p3_grp)
 Output
Rows
213
columns
16
alcconsumption : alcohol consumption per adult (age 15+) in litres
min and max value of alcconsumption :
0.03
23.01
F.D. of groups of values of variable alcconsumption :
(0.0, 2.5]      45
(2.5, 5.0]      36
(5.0, 7.5]      31
(7.5, 10.0]     30
(10.0, 12.5]    21
(12.5, 15.0]    13
(15.0, 17.5]     8
(17.5, 20.0]     2
(20.0, 22.5]     0
(22.5, 25.0]     1
NaN             26
Name: alcgrps, dtype: int64
Percentage (with NaN set aside) :
(0.0, 2.5]     0.240642
(2.5, 5.0]     0.192513
(5.0, 7.5]     0.165775
(7.5, 10.0]    0.160428
(10.0, 12.5]   0.112299
(12.5, 15.0]   0.069519
(15.0, 17.5]   0.042781
(17.5, 20.0]   0.010695
(20.0, 22.5]   0.000000
(22.5, 25.0]   0.005348
Name: alcgrps, dtype: float64
---------------------------
urbanrate : Percentage of population living in urban areas
min and max values :
10.4
100.0
F.D. of groups of values of variable urbanrate :
(0.0, 10.0]       0
(10.0, 20.0]     13
(20.0, 30.0]     22
(30.0, 40.0]     24
(40.0, 50.0]     22
(50.0, 60.0]     24
(60.0, 70.0]     34
(70.0, 80.0]     24
(80.0, 90.0]     21
(90.0, 100.0]    19
NaN              10
Name: urbgrps, dtype: int64
Percentage (with NaN set aside) :
(0, 10]     0.000000
(10, 20]    0.064039
(20, 30]    0.108374
(30, 40]    0.118227
(40, 50]    0.108374
(50, 60]    0.118227
(60, 70]    0.167488
(70, 80]    0.118227
(80, 90]    0.103448
(90, 100]   0.093596
Name: urbgrps, dtype: float64
---------------------------
lifeexpectancy : years in avg a new born baby would live in current situation
min and max values :
47.794
83.39399999999999
F.D. of groups of values of variable lifeexpectancy :
(0.0, 50.0]       9
(50.0, 60.0]     29
(60.0, 70.0]     38
(70.0, 80.0]     92
(80.0, 90.0]     23
(90.0, 100.0]     0
NaN              22
Name: lifgrps, dtype: int64
Percentage (with NaN set aside) :
(0, 50]     0.047120
(50, 60]    0.151832
(60, 70]    0.198953
(70, 80]    0.481675
(80, 90]    0.120419
(90, 100]   0.000000
Name: lifgrps, dtype: float64
  Description
1.      The values of the variables were grouped and F.D. (Frequency Distribution) of them are provided in output.
2.      Frequency distributions:
a.      alcconsumption : There are 26 unknown (NaN) values. The number (distribution) of countries (individuals) goes on decreasing consistently as the per capita alcohol consumption increases.
b.      urbanrate : There are 10 missing (NaN) values. No country has less tan 10% urbanrate and 9.3596% countries have urban rate greater than 90% and less than or equal to 100%. The F.D. does not follow a consistent (increasing or decreasing) curve as alcconsumption.
c.       lifeexpectancy : There are 22 missing values. 4% countries have lifeexpectancy less than and equal to 50%. The number of countries increases as the lifeexpectancy increases util 70-80 years. Then the number of countries distributed in intervals with increasing values of ifeexpectancy goes on decreasing.
0 notes
shskpadhy · 5 years ago
Text
Assignment_2
0 notes
shskpadhy · 5 years ago
Text
 Assignment_2 (Writing Your First Program)  
0 notes
shskpadhy · 5 years ago
Text
Assignment_1
Week 01
Choosing a Research Question
Dataset :
I have chosen the GapMinder dataset
Tumblr media
After a long research and juggling with different topics and their association, I finally am sticking to the following topics and research question:
Topics:
1.      Alcohol consumption per adult (15+ years)
2.      Urban population percentage
Research Question:
Do the countries with more people living in the urban areas report more per capita alcohol consumption?
Hypothesis:
More is the percentage of population of a country settled in the urban area, more is the per-capita alcohol consumption in that country.
This hypothesis would be proved to be true or false.
 Codebook:
It contains information on the two variables, alcconsumption and urbanrate.
Research Works Readings
1.      As pre the ‘Rural, Suburban, and Urban Variations in Alcohol Consumption in the United States: Findings From the National Epidemiologic Survey on Alcohol and Related Conditions’: Abstinence is particularly common in the rural South, whereas alcohol disorders and excessive drinking are more problematic in the urban and rural Midwest. Health policies and interventions should be further targeted toward those places with higher risks of problem drinking
2.      As per the ‘Patterns and trends of alcohol consumption in rural and urban areas of China: findings from the China Kadoorie Biobank’: At baseline, 33% of men drank alcohol at least weekly (i.e., current regular), compared to only 2% of women. In men, current regular drinking was more common in urban (38%) than in rural (29%) areas at baseline. Among men, the proportion of current regular drinkers slightly decreased at resurvey (33% baseline vs. 29% resurvey), while the proportion of ex-regular drinkers slightly increased (4% vs. 6%), particularly among older men, with more than half of ex-regular drinkers stopping for health reasons. Among current regular drinkers, the proportion engaging in heavy episodic drinking (i.e., > 60 g/session) increased (30% baseline vs. 35% resurvey) in both rural (29% vs. 33%) and urban (31% vs. 36%) areas, particularly among younger men born in the 1970s (41% vs. 47%). Alcohol intake involved primarily spirits, at both baseline and resurvey. Those engaging in heavy drinking episodes tended to have multiple other health-related risk factors (e.g., regular smoking, low fruit intake, low physical activity and hypertension).
Conclusions from research readings:
In China, urban population is somewhat ahead in drinking as compared to the rural population, whereas in USA, the trend is not much uniform (i.e. in South urbn population is consumes more alcohol than rural whereas in mid-west there is no difference between the two classes of population in terms of alcohol consumption).
1 note · View note