Text
Assignment 03-04 (Testing a Logistic Regression Model)
Dataset : Gapminder
Variables
The following derived variabbles (obtained by categorizing provided variables) are used:
lifgrps (response variable) : derived from lifexpectancy by setting the calue for lifexpectancy greater than or equal to 65 as 1 else 0.
urbgrps (primary explanatory variable) : derived from urbanrate, for individuals with urbanrate more than mean (urb_mean), value is 1 else 0.
alcgrps : derived from alcconsumption, for individuals with alcconsumption more than mean (alc_mean), value is 1 else 0.
incgrps : derived from incomeperperson, for individuals with incomeperperson more than mean (inc_mean), value is 1 else 0.
relgrps : derived from relectricperperson, for individuals with relectricperperson more than mean (rel_mean), value is 1 else 0.
The explanation of the variables were provided in the previous post.
Research Question
H0 :There does not exist an association between urbanrate and lifexpectancy
H1 : Lifexpectancy increases with urbanrate
Here we would test the research question as:
H1 : Number of countries with lifgrps = 1 in urbgrps = 1 category is more than that of the category urbgrps = 0.
H0 : There does not exist such association
Output
Rows 213 columns 16 =================================== Logistic Regression Modelling =================================== lreg1 : lifgrps ~ urbgrps Optimization terminated successfully. Current function value: 0.591261 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: lifgrps No. Observations: 213 Model: Logit Df Residuals: 211 Method: MLE Df Model: 1 Date: Fri, 24 Jul 2020 Pseudo R-squ.: 0.07575 Time: 23:20:52 Log-Likelihood: -125.94 converged: True LL-Null: -136.26 Covariance Type: nonrobust LLR p-value: 5.534e-06 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 0.0202 0.201 0.101 0.920 -0.374 0.414 urbgrps[T.1L] 1.3552 0.308 4.400 0.000 0.751 1.959 ================================================================================= Odds Ratios Intercept 1.020408 urbgrps[T.1L] 3.877391 dtype: float64 odd ratios with 95% confidence intervals Lower CI Upper CI OR Intercept 0.688125 1.513145 1.020408 urbgrps[T.1L] 2.120087 7.091294 3.877391 -------------------------------- Optimization terminated successfully. Current function value: 0.625930 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: lifgrps No. Observations: 213 Model: Logit Df Residuals: 211 Method: MLE Df Model: 1 Date: Fri, 24 Jul 2020 Pseudo R-squ.: 0.02155 Time: 23:20:52 Log-Likelihood: -133.32 converged: True LL-Null: -136.26 Covariance Type: nonrobust LLR p-value: 0.01537 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.4055 0.179 2.265 0.024 0.055 0.756 alcgrps 0.7419 0.313 2.371 0.018 0.129 1.355 ============================================================================== Hence lifgrps is not associated with alcgrps -------------------------------- Optimization terminated successfully. Current function value: 0.631670 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: lifgrps No. Observations: 213 Model: Logit Df Residuals: 211 Method: MLE Df Model: 1 Date: Fri, 24 Jul 2020 Pseudo R-squ.: 0.01258 Time: 23:20:53 Log-Likelihood: -134.55 converged: True LL-Null: -136.26 Covariance Type: nonrobust LLR p-value: 0.06407 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 0.4841 0.175 2.772 0.006 0.142 0.826 incgrps[T.1L] 0.5788 0.318 1.819 0.069 -0.045 1.203 ================================================================================= Hence lifgrps is not associated with alcgrps --------------------------------
Logit Regression Results ============================================================================== Dep. Variable: lifgrps No. Observations: 213 Model: Logit Df Residuals: 211 Method: MLE Df Model: 1 Date: Fri, 24 Jul 2020 Pseudo R-squ.: 0.01258 Time: 23:20:53 Log-Likelihood: -134.55 converged: True LL-Null: -136.26 Covariance Type: nonrobust LLR p-value: 0.06407 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 0.4841 0.175 2.772 0.006 0.142 0.826 relgrps[T.1L] 0.5788 0.318 1.819 0.069 -0.045 1.203 ================================================================================= Hence lifgrps is not associated with alcgrps --------------------------------
Summary
The logistic regression model with lifgrps as response variable and urbgrps as explanatory variable depicts that lifgrps is well associated with urbgrps. The following statistics were obtained from summary:
p-value : less than 0.0001
odds : (2.120087, 7.091294, 3.877391) for lower, upper and OR. 95% confidence interval was taken
Then association of lifgrps with alcgrps, incgrps and relgrps was also tested individually but the results showed that no association exists as can be interpreted by higher p-values.
Regarding the research question, the null hypothesis can be rejected as there is enough evidence against it, as can be seen from significant p-values. Thus there is an association between lifgrps and urbgrps.
Finally The Code
import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import scipy.stats
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy # incomeperperson
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True) data['relectricperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data2 = data
# Categorizing lifeexpectancy as lifgrps
def lifgrps (row): if row['lifeexpectancy'] >= 65 : return 1 else : return 0 data2['lifgrps'] = data2.apply (lambda row: lifgrps (row),axis=1) data2['lifgrps'] = data2['lifgrps'].convert_objects(convert_numeric=True)
# Logistic Regression Modelling
print ('===================================') print ('Logistic Regression Modelling') print ('===================================')
# Categorizing urbanrate as urbgrps
urb_mean = data2['urbanrate'].mean() def urbgrps (row): if row['urbanrate'] <= urb_mean : return 0 else : return 1 data2['urbgrps'] = data2.apply (lambda row: urbgrps (row),axis=1) data2["urbgrps"] = data2["urbgrps"].astype('category')
print ('lreg1 : lifgrps ~ urbgrps') lreg1 = smf.logit(formula = 'lifgrps ~ urbgrps', data = data2).fit() print (lreg1.summary())
# odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params))
# odd ratios with 95% confidence intervals print ('odd ratios with 95% confidence intervals') params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
print ('--------------------------------')
# categorizing alcconsumption into grps alc_mean = data2['alcconsumption'].mean() def alcgrps (row): if row['alcconsumption'] >= alc_mean : return 1 else : return 0 data2['alcgrps'] = data2.apply (lambda row: alcgrps (row),axis=1) data2['alcgrps'] = data2['alcgrps'].convert_objects(convert_numeric=True)
lreg2_1 = smf.logit(formula = 'lifgrps ~ alcgrps', data = data2).fit() print (lreg2_1.summary())
print ('Hence lifgrps is not associated with alcgrps')
print ('--------------------------------')
# categorizing incomeperperson inc_mean = data2['incomeperperson'].mean() def incgrps (row): if row['incomeperperson'] <= inc_mean : return 0 else : return 1 data2['incgrps'] = data2.apply (lambda row: incgrps (row),axis=1) data2["incgrps"] = data2["incgrps"].astype('category')
lreg3_1 = smf.logit(formula = 'lifgrps ~ incgrps', data = data2).fit() print (lreg3_1.summary())
print ('Hence lifgrps is not associated with alcgrps')
print ('--------------------------------')
# Categorizing urbanrate as urbgrps
rel_mean = data2['relectricperperson'].mean() def relgrps (row): if row['relectricperperson'] <= rel_mean : return 0 else : return 1 data2['relgrps'] = data2.apply (lambda row: relgrps (row),axis=1) data2["relgrps"] = data2["relgrps"].astype('category')
lreg4_1 = smf.logit(formula = 'lifgrps ~ relgrps', data = data2).fit() print (lreg4_1.summary())
print ('Hence lifgrps is not associated with alcgrps')
print ('--------------------------------')
0 notes
Text
Assignment 03_03 (Test A Multiple Regression Model)
Dataset
GapMinder
Variables
· alcconsumption (response variable) : per capita (age : 15+ years) alcohol consumption of a country
· urbanrate (primary explanatory variable) : percentage of population of the country settled in urban areas
· internetuserate : number of people per 100, who have access to the world wide web
· incomeperperson : Gross Domestic Product per capita in constant 2000 US$
Primary Research Question
H0 : there is no association between alcconsumption and urbanrate
HA : alcconsumption increases with urbanrate
Summary
· Considering the research question, we get enough evidence against H0 and hence conclude that alcconsumption increases with urbanrate. This can be seen from the summary of reg1 (specification with urbanrate as the only explanatory variable and alcconsumption as response variable), which has p-value 0.000171 and r2 value 0.075. The equation of regression is y = 6.8453 + 0.0591*xurb_c (urb_c is the urbanrate with mean centered at 0). The standardized residuals plot tells that the model is acceptable as less than 5% residuals fall outside |y|=2. Still the r2 value is very less.
· In reg2, int_c (internetuserate with mean centred at 0) was added, seeking greater r2. It was found that internetuserate confounded urbanrate. r2 = 0.303 and p-value = 1.8*10-14 Also equation : y = 7.7316 + 0.1077*xint_c + 0.0010*xint_c**2 From the regression diagnostic plots also it is clear that the model is acceptable.
· Taking the quest ahead, in reg3, inc_c (incomeperperson with mean centred at 0) was taken into account. Now int_c2 got confounded by inc_c, hence we remove it. r2= 0.338 and p-value= 3.96e-16 Regression line : y = 6.7842 + 0.1484*int_c + 0.0002*inc_c Again reg3 can be accepted as is evident from the diagnostic plots (<5% standard residuals fall beyond |y|=2).
· Hence the final model is reg3, it does not include the primary explanatory variable.
· Considering the regression diagnostic plots:
o QQ-plots for reg2 and reg3 are almost similar
o QQ-plot for reg1 shows bit more deviation indicating more errors
o Again standard residuals plot for reg2 and reg3 are similar with only 3 (1.71%) observations falling beyond |y|=2 , which is 4 (2.29%) for reg1. As both the percentages are below 5%, the models are acceptable.
o The leverage plot of reg3shows that the residuals decrease with increase in inc_c and also with increase in int_c. This means the residuals are not independent of the values of the explanatory variables. Thus there must be another explanatory variable that is also associated with the response variable alcconsumption.
Output
================================== Multiple Regression Modelling ================================== Starting with primary explanatory variavle : urbanrate OLS Regression Results ============================================================================== Dep. Variable: alcconsumption R-squared: 0.075 Model: OLS Adj. R-squared: 0.070 Method: Least Squares F-statistic: 14.73 Date: Mon, 20 Jul 2020 Prob (F-statistic): 0.000171 Time: 17:50:32 Log-Likelihood: -543.98 No. Observations: 183 AIC: 1092. Df Residuals: 181 BIC: 1098. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 6.8453 0.353 19.409 0.000 6.149 7.541 urb_c 0.0591 0.015 3.838 0.000 0.029 0.089 ============================================================================== Omnibus: 10.025 Durbin-Watson: 1.958 Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.257 Skew: 0.573 Prob(JB): 0.00592 Kurtosis: 3.178 Cond. No. 23.0 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. --------------------------------------
Including internetuserate
============================================================================== Dep. Variable: alcconsumption R-squared: 0.303 Model: OLS Adj. R-squared: 0.296 Method: Least Squares F-statistic: 38.13 Date: Mon, 20 Jul 2020 Prob (F-statistic): 1.80e-14 Time: 17:52:17 Log-Likelihood: -504.45 No. Observations: 178 AIC: 1015. Df Residuals: 175 BIC: 1024. Df Model: 2 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 7.7316 0.485 15.937 0.000 6.774 8.689 int_c 0.1077 0.013 8.497 0.000 0.083 0.133 I(int_c ** 2) -0.0010 0.000 -2.116 0.036 -0.002 -6.72e-05 ============================================================================== Omnibus: 6.860 Durbin-Watson: 2.042 Prob(Omnibus): 0.032 Jarque-Bera (JB): 7.862 Skew: 0.301 Prob(JB): 0.0196 Kurtosis: 3.836 Cond. No. 1.67e+03 ==============================================================================
As r-square value for int_c is greater than urb_c, we remove urb_c from our model Current r-square : 0.303, p-value : 1.80e-14 ------------------------------------------------------------------------------------ Considering incomeperperson
============================================================================== Dep. Variable: alcconsumption R-squared: 0.338 Model: OLS Adj. R-squared: 0.330 Method: Least Squares F-statistic: 43.90 Date: Mon, 20 Jul 2020 Prob (F-statistic): 3.96e-16 Time: 17:59:34 Log-Likelihood: -491.08 No. Observations: 175 AIC: 988.2 Df Residuals: 172 BIC: 997.7 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 6.7842 0.311 21.815 0.000 6.170 7.398 int_c 0.1484 0.018 8.063 0.000 0.112 0.185 inc_c -0.0002 5.02e-05 -3.614 0.000 -0.000 -8.23e-05 ============================================================================== Omnibus: 5.758 Durbin-Watson: 2.004 Prob(Omnibus): 0.056 Jarque-Bera (JB): 6.258 Skew: 0.273 Prob(JB): 0.0438 Kurtosis: 3.749 Cond. No. 1.05e+04 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.05e+04. This might indicate that there are strong multicollinearity or other numerical problems.
As int_c**2 get confounded by inc_c, we remobe int_c**2 Current r-square : 0.338, p-value : 3.96e-16 ---------------------------------------------------------- ---------------------------------------------------------- ============================================ Regression Diagnostic Plots ============================================ QQ-plots for all stages
S
-------------------------------------
Standard Residuals for all stages
reg1

4 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem
reg2
3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem
reg3
3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem For reg2 amd reg3,there is one extreme outlier each (falls above y=3) ----------------------------------------------------- Leverage Plot
Finally The Code
# Assignment 03-03
import numpy import pandas import statsmodels.api as sm import seaborn import statsmodels.formula.api as smf import matplotlib.pyplot as plt
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption (response variable) # urbanrate (primary explanatory variable) # Internetuserate # incomeperperson
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
# Multiple Regression Modelling
print ('==================================') print ('Multiple Regression Modelling') print ('==================================')
# urbanrate and alcconsumption print ('Starting with primary explanatory variavle : urbanrate')
data['urb_c'] = data['urbanrate'] - data['urbanrate'].mean()
reg1 = smf.ols(formula = 'alcconsumption ~ urb_c', data=data).fit() print (reg1.summary())
print ('--------------------------------------')
# Including internetuserate
print ('Including internetuserate')
data['int_c'] = data['internetuserate'] - data['internetuserate'].mean()
reg2 = smf.ols(formula = 'alcconsumption ~ int_c + I(int_c**2)', data=data).fit() print (reg2.summary())
print ('Current r-square : 0.075, p-value : 0.000171')
print ('--------------------------------------')
# Considering internetuserate
print ('Considering internetuserate')
data['int_c'] = data['internetuserate'] - data['internetuserate'].mean()
reg2 = smf.ols(formula = 'alcconsumption ~ int_c + I(int_c**2)', data=data).fit() print (reg2.summary())
print ('As r-square value for int_c is greater than urb_c, we remove urb_c from our model') print ('Current r-square : 0.303, p-value : 1.80e-14')
print ('------------------------------------------------------------------------------------')
# Considering incomeperperson
print ('Considering incomeperperson')
data['inc_c'] = data['incomeperperson'] - data['incomeperperson'].mean()
reg3 = smf.ols(formula = 'alcconsumption ~ int_c + inc_c', data=data).fit() print (reg3.summary())
print ('As int_c**2 get confounded by inc_c, we remobe int_c**2') print ('Current r-square : 0.338, p-value : 3.96e-16')
print ('----------------------------------------------------------') print ('----------------------------------------------------------')
# Regression Diagnostic Plots
print ('============================================') print ('Regression Diagnostic Plots') print ('============================================')
print ('QQ-plots for all stages') err_qq_1 = sm.qqplot(reg1.resid, line='r') err_qq_2 = sm.qqplot(reg2.resid, line='r') err_qq_3 = sm.qqplot(reg3.resid, line='r') print ('-------------------------------------')
print ('Standardized residuals for all stages')
print ('reg1') std_res_1 = pandas.DataFrame(reg1.resid_pearson) err_std_1 = plt.plot(std_res_1, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_1) print ('4 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')
print ('reg2') std_res_2 = pandas.DataFrame(reg2.resid_pearson) err_std_2 = plt.plot(std_res_2, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_2) print ('3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')
print ('reg3') std_res_3 = pandas.DataFrame(reg3.resid_pearson) err_std_3 = plt.plot(std_res_3, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_3) print ('3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')
print ('For reg2 amd reg3,there is one extreme outlier each (falls above y=3)')
print ('-----------------------------------------------------')
# Leverage Plot print ('Leverage Plot') err_lev_1 = plt.figure() err_lev_1 = sm.graphics.plot_regress_exog(reg3, 'int_c', fig=err_lev_1) err_lev_1
err_lev_2 = plt.figure() err_lev_2 = sm.graphics.plot_regress_exog(reg3, 'inc_c', fig=err_lev_2) err_lev_2
0 notes
Text
Assignment 03-02 (Testing A Basic Linear Regression Model)
Dataset : GapMinder
Variables
urbanrate (primary explanatory variable) : centered at 0 and stored in urb_c, this urb_c is used for modelling the linear regression model
lifeexpectancy (response variable)
Summary
slope : 0.2628
intercept : 69.6752
Hence y = 69.6752 + 0.2628x, where x = urban rate - 56.7693596059(mean urban rate) and y = expected life-expectancy
Value of r-squared = 0.375, F-statistic = 104.6 and p-value = 1.61e-19 and correlation coefficien t = 0.6127112161764898
Output
Rows 213 columns 16 ================================== Regression Modelling ================================== Centring explanatory variable urbanrate and storing it into variable urb_c mean of urbanrate is 56.7693596059 Describing urb_c count 203.000000 mean 0.000000 std 23.844933 min -46.369360 25% -19.939360 50% 1.170640 75% 17.440640 max 43.230640 Name: urb_c, dtype: float64 Finally the regression model OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.375 Model: OLS Adj. R-squared: 0.372 Method: Least Squares F-statistic: 104.6 Date: Fri, 17 Jul 2020 Prob (F-statistic): 1.61e-19 Time: 17:46:01 Log-Likelihood: -610.02 No. Observations: 176 AIC: 1224. Df Residuals: 174 BIC: 1230. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 69.6752 0.589 118.201 0.000 68.512 70.839 urb_c 0.2628 0.026 10.227 0.000 0.212 0.313 ============================================================================== Omnibus: 11.099 Durbin-Watson: 1.866 Prob(Omnibus): 0.004 Jarque-Bera (JB): 12.104 Skew: -0.637 Prob(JB): 0.00235 Kurtosis: 2.827 Cond. No. 23.0 ==============================================================================
Running Pearsons Correlation Test for getting value of correlation coefficient (0.6127112161764898, 1.607809724025055e-19)
Finally The Code
# -*- coding: utf-8 -*- """ Created on Fri Jul 17 16:01:24 2020
@author: ASUS """
import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import seaborn import scipy
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
# Regression Modelling
print ('==================================') print ('Regression Modelling') print ('==================================')
# centring explanatory variable urbanrate and storing it into variable urb_c print ('Centring explanatory variable urbanrate and storing it into variable urb_c') urb_m = data['urbanrate'].mean() print ('mean of urbanrate is ') print (urb_m)
def urb_c (row): return row['urbanrate']-urb_m
data['urb_c'] = data.apply (lambda row: urb_c (row), axis=1)
print ('Describing urb_c') temp = data['urb_c'].describe() print (temp)
data_c=data.dropna()
print ('Finally the regression model') reg1 = smf.ols(formula = 'lifeexpectancy ~ urb_c', data=data_c).fit() print (reg1.summary())
scat3 = seaborn.regplot(x="urb_c", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate centered at 0') plt.ylabel('Life-expectancy')
print ('Running Pearsons Correlation Test for getting value of correlation coefficient') print (scipy.stats.pearsonr(data_c['urb_c'], data_c['lifeexpectancy']))
0 notes
Text
Assignment 03-01 (Writing About Your Data)
Sample
The sample is taken from the GapMinder dataset.
As collective information is taken about each country (observation), i.e the individuals are the countries, it is an aggregate level analysis.
Data Collection Procedure
The goal of GapMinder Foundations is to fight devastating ignorance with a fact based world view that everyone can understand, using inferences from the data collected.
The dataset contains data on variables like life-expectancy, urban-rate, per-capita alcohol consumption (for people above 15 years), income per person, HIV rate, number of breast-cancer cases per 100 thousand women, etc for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas.
The data is collected by GapMinder from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.
As the GapMinder foundations has not carried out any survey or experiment on its own, nor has it observed the population, it can be concluded that the study design generating the data is data-reporting.
The data regarding different variables of the dataset was taken from different sources at different times, there is no information regarding when the whole data was collected.
Clearly the data is not an experimental data as no explanatory variable is manipulated. Rather it is an observational data.
Variables
0 notes
Text
Assignment 03-01 (Writing About Your Data)
Sample
The GapMinder dataset is used.
Data Collection Procedure
The dataset contains data on variables like life-expectancy, urban-rate, per-capita alcohol consumption (for people above 15 years), income per person, HIV rate, number of breast-cancer cases per 100 thousand women, etc for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas.
The data is collected by GapMinder from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.
Clearly the data is not an experimental data as no explanatory variable is manipulated. Rather it is an observational data.
Variables
alcconsumption : it measures the amount of pure alcohol consumed by an individual (age 15+) in litres in a year.
0 notes
Text
Assignment 02_04 (Course-02, Week-04 : Testing A Potential Moderator)
Dataset : GapMinder
Variiables
urbanrate (explanatory variable)
lifeexpectancy (response variable)
alcgrps : alcconsumption collapsed into 4 groups containing 1st, 2nd, 3rd and 4th quartiles
Summary
The last row shows the correlation coefficient and p-value for the Pearson Correlation Coefficient Test where the whole dataset is considered.
The direction of the association does not appear to change with the moderation variable but the strength appears to change slightly.
Thus the moderator does not strongly alter the association between the two variables.
Output
Rows 213 columns 16 =============================================================== urbanrate vs lifeexpectancy without moderator ===============================================================
Overall analysis
association between urbanrate and lifeexpectancy (0.6075222955616916, 1.2247333760171806e-05)
=============================================================== urbanrate vs lifeexpectancy with alcconsumption as moderator =============================================================== alcgrps=1 : upto 25%ile from bottom
(0.6235967761519358, 3.6639464319432514e-06) --------------------------------------------------------
alcgrps=1 : between 25%ile and 50%ile from bottom
(0.45570230600060746, 0.0018802338429809984) ---------------------------------------
alcgrps=1 : between 50%ile and 75%ile from bottom
(0.5786487797163785, 5.9669496046994585e-05) ---------------------------------------
alcgrps=1 : between 75%ile and 100%ile from bottom
(0.6075222955616916, 1.2247333760171806e-05) ---------------------------------------
Finally The Code
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
# urbanrate vs. lifeexpectancy
print ('===============================================================') print ('urbanrate vs lifeexpectancy without moderator') print ('===============================================================')
data_c=data_sub.dropna()
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('Scatterplot for the Association Between Urban-rate amd Life-expectancy')
print ('Overall analysis') print ('association between urbanrate and lifeexpectancy') print (scipy.stats.pearsonr(data_c['urbanrate'], data_c['lifeexpectancy']))
# urbanrate vs. lifeexpectancy with alcconsumption as moderator
print ('===============================================================') print ('urbanrate vs lifeexpectancy with alcconsumption as moderator') print ('===============================================================')
#temp = data['alcconsumption'].describe() #print (temp)
# collapsing alcconsumption into groups
def alcgrps (row): if row['alcconsumption'] <= 2.625000: return 1 elif row['alcconsumption'] <= 5.920000: return 2 elif row['alcconsumption'] <= 9.925000: return 3 else: return 4
data['alcgrps'] = data.apply (lambda row: alcgrps (row), axis=1)
# alcgrpups = 1
print ('alcgrps=1 : upto 25%ile from bottom')
data_sub = data[(data['alcgrps']==1)] #print (data_sub['alcgrps'])#.describe())
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=1')
data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
print ('--------------------------------------------------------')
# alcgrpups = 2
print ('alcgrps=1 : between 25%ile and 50%ile from bottom')
data_sub = data[(data['alcgrps']==2)] #print (data_sub['alcgrps'])#.describe())
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=2')
data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
print ('---------------------------------------')
# alcgrpups = 3
print ('alcgrps=1 : between 50%ile and 75%ile from bottom')
data_sub = data[(data['alcgrps']==3)] #print (data_sub['alcgrps'])#.describe())
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=3')
data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
print ('---------------------------------------')
# alcgrpups = 4
print ('alcgrps=1 : between 75%ile and 100%ile from bottom')
data_sub = data[(data['alcgrps']==4)] #print (data_sub['alcgrps'])#.describe())
scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=4')
data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
print ('---------------------------------------')
0 notes
Text
Assignment 02_03 (Course-02, Week-03)
Generating Correlation Coefficient
Dataset : GapMinder
Variables :
urbanrate : % of population of the country living in urban areas
alcconsumption : per capita (age : 15+) alcohol consumption in a year
lifeexpectancy : how many years a normal new born baby would live in current situation
Research Question (Hypothesis)
Research Question 1
H0 : urbanrate and alcconsumption are independent of each other
HA : alcconsumption increases with urbanrate
Research Question 2
H0 : urbanrate and lifeexpectancy are independent of each other
HA : lifeexpectancy increases with urbanrate
Summary
Research Question 1
Value of correlation coefficient = 0.27446605904089333, thus there exists a positive correlation between urbanrate and alcconsumption, i.e. alcconsumption increases with urbanrate.
p-value = 0.00022753282212695448, which is less than 0.05
From points 1 and 2 above, it is concluded that H0 (null hypothesis) is rejected and HA (alternate hypothesis) is accepted
Although knowing the value of urbanrate, we can determine only 7.5331% variability in alcconsumption.
Research Question 2
Value of correlation coefficient = 0.6127112161764898, thus there exists a positive correlation between urbanrate and lifeexpectancy, i.e. lifeexpectamcy increases with urbanrate.
p-value = 1.607809724025055e-19, which is very much less than 0.05
From points 1 and 2 above, it is concluded that H0 (null hypothesis) is rejected and HA (alternate hypothesis) is accepted
Although knowing the value of urbanrate, we can determine only 37.5415% ariability in lifeexpectancy.
Output
Rows 213 columns 16 =================================== urbanrate vs alcconsumption =================================== Descriptive analysis
---------------------------- Inferencial analysis association between urbanrate and alcconsumption (0.27446605904089333, 0.00022753282212695448)
=================================== urbanrate vs lifeexpectancy =================================== Descriptive analysis
---------------------------- Inferencial analysis association between urbanrate and lifeexpectancy (0.6127112161764898, 1.607809724025055e-19)
Finally The Code
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data_clean=data.dropna()
# urbanrate vs. alcconsumption
print ('===================================') print ('urbanrate vs alcconsumption') print ('===================================')
print ('Descriptive analysis') scat1 = seaborn.regplot(x="urbanrate", y="alcconsumption", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Per Capita Alcohol Consumption') plt.title('Scatterplot for the Association Between Urban Rate and Alcohol Consumption')
print ('----------------------------') print ('Inferencial analysis') print ('association between urbanrate and alcconsumption') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['alcconsumption']))
# urbanrate vs. lifeexpectancy
print ('===================================') print ('urbanrate vs lifeexpectancy') print ('===================================')
print ('Descriptive analysis') scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('Scatterplot for the Association Between Urban-rate amd Life-expectancy')
print ('----------------------------') print ('Inferencial analysis') print ('association between urbanrate and lifeexpectancy') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
0 notes
Text
Assignment-02_02 (Course-02, Module-02)
Running Chi-Square Independence Test
Dataset : GapMinder
Variables
urbanrate (urbgrps after collapsing into categories)
lifeexpectancy ( lifgrps after collapsing into categories)
urbgrps : urbanrate is collapsed into categories of width 20%, hence having 5 categories (1 for 0%-20%, 2 for 20% to 40%, .... and 5 for 80% to 100%)
lifgrps : It is a two-valued variable, with value 1 if the country has life-expectancy greater than or equal ti 65% and 0 otherwise.
Research Question : Hypothesis
H0 : proportion of countries in each category of urbgrps having life-expectancy of 65+ years is equal
HA : The proportion is unequal for atleast two groups
Summary
The p-value for the Chi-square test is 0.000008, which is significantly less than 0.05. Hence we can reject the null hypothesis.
The following table shows the p-values of chi-square tests for each pair of categories.
There are 10 comparisons, hence our threshold p-value must be 0.05/10 = 0.005
It is clear that category-4 (countries with urban-rate between 60% and 80%) has significantly different life-expectancy
Output
Rows 213 columns 16 ===================================== Descriptive Statistical Analysis =====================================
===================================== Inferencial Statistical Analysis ===================================== urbgrps 1 2 3 4 5 lifgrps 0 9 24 17 6 16 1 4 22 29 52 34 G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:46: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:47: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:48: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:75: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. return 0 urbgrps 1 2 3 4 5 lifgrps 0 0.692308 0.521739 0.369565 0.103448 0.320000 1 0.307692 0.478261 0.630435 0.896552 0.680000 chi-square value, p value, expected counts (28.770249676716467, 8.704036143643654e-06, 4L, array([[ 4.3943662 , 15.54929577, 15.54929577, 19.6056338 , 16.90140845], [ 8.6056338 , 30.45070423, 30.45070423, 38.3943662 , 33.09859155]])) ============================= Post-hoc test New threshold for p-value : 0.005 1 vs 2 comp1v2 1.000000 2.000000 lifgrps 0 9 24 1 4 22 comp1v2 1.000000 2.000000 lifgrps 0 0.692308 0.521739 1 0.307692 0.478261 chi-square value, p value, expected counts (0.6044210109845561, 0.43689616446201507, 1L, array([[ 7.27118644, 25.72881356], [ 5.72881356, 20.27118644]])) 1 vs 3 comp 1.000000 3.000000 lifgrps 0 9 17 1 4 29 comp 1.000000 3.000000 lifgrps 0 0.692308 0.369565 1 0.307692 0.630435 chi-square value, p value, expected counts (3.073965958790373, 0.07955517016836518, 1L, array([[ 5.72881356, 20.27118644], [ 7.27118644, 25.72881356]])) 1 vs 4 comp 1.000000 4.000000 lifgrps 0 9 6 1 4 52 comp 1.000000 4.000000 lifgrps 0 0.692308 0.103448 1 0.307692 0.896552 chi-square value, p value, expected counts (18.70646985916383, 1.524642950087753e-05, 1L, array([[ 2.74647887, 12.25352113], [10.25352113, 45.74647887]])) 1 vs 5 comp 1.000000 5.000000 lifgrps 0 9 16 1 4 34 comp 1.000000 5.000000 lifgrps 0 0.692308 0.320000 1 0.307692 0.680000 chi-square value, p value, expected counts (4.520721862348178, 0.03348669970950359, 1L, array([[ 5.15873016, 19.84126984], [ 7.84126984, 30.15873016]])) 2 vs 3 comp 2.000000 3.000000 lifgrps 0 24 17 1 22 29 comp 2.000000 3.000000 lifgrps 0 0.521739 0.369565 1 0.478261 0.630435 chi-square value, p value, expected counts (1.5839311334289814, 0.20819535296566563, 1L, array([[20.5, 20.5], [25.5, 25.5]])) 2 vs 4 comp 2.000000 4.000000 lifgrps 0 24 6 1 22 52 comp 2.000000 4.000000 lifgrps 0 0.521739 0.103448 1 0.478261 0.896552 chi-square value, p value, expected counts (19.878233856044954, 8.253469598112272e-06, 1L, array([[13.26923077, 16.73076923], [32.73076923, 41.26923077]])) 2 vs 5 comp 2.000000 5.000000 lifgrps 0 24 16 1 22 34 comp 2.000000 5.000000 lifgrps 0 0.521739 0.320000 1 0.478261 0.680000 chi-square value, p value, expected counts (3.2246459627329176, 0.07253748022033618, 1L, array([[19.16666667, 20.83333333], [26.83333333, 29.16666667]])) 3 vs 4 comp 3.000000 4.000000 lifgrps 0 17 6 1 29 52 comp 3.000000 4.000000 lifgrps 0 0.369565 0.103448 1 0.630435 0.896552 chi-square value, p value, expected counts (9.059129050611572, 0.0026138645525417546, 1L, array([[10.17307692, 12.82692308], [35.82692308, 45.17307692]])) 3 vs 5 comp 3.000000 5.000000 lifgrps 0 17 16 1 29 34 comp 3.000000 5.000000 lifgrps 0 0.369565 0.320000 1 0.630435 0.680000 chi-square value, p value, expected counts (0.08745341614906833, 0.7674399219661803, 1L, array([[15.8125, 17.1875], [30.1875, 32.8125]])) 4 vs 5 comp 4.000000 5.000000 lifgrps 0 6 16 1 52 34 comp 4.000000 5.000000 lifgrps 0 0.103448 0.320000 1 0.896552 0.680000 chi-square value, p value, expected counts (6.485275205948824, 0.01087716976280061, 1L, array([[11.81481481, 10.18518519], [46.18518519, 39.81481481]]))
Finslly The Code
import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import scipy.stats
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
# Categorizing urbanrate as urbgrps
def urbgrps (row): if row['urbanrate'] <= 20 : return 1 elif row['urbanrate'] <= 40 : return 2 elif row['urbanrate'] <= 60 : return 3 elif row['urbanrate'] <= 80 : return 4 else : return 5 data2['urbgrps'] = data2.apply (lambda row: urbgrps (row),axis=1) data2["urbgrps"] = data2["urbgrps"].astype('category')
# Categorizing lifeexpectancy as lifgrps
def lifgrps (row): if row['lifeexpectancy'] >= 65 : return 1 else : return 0 data2['lifgrps'] = data2.apply (lambda row: lifgrps (row),axis=1) # Setting urbgrps to numeric data2['lifgrps'] = data2['lifgrps'].convert_objects(convert_numeric=True)
# Descriptive Analysis
print ('=====================================') print ('Descriptive Statistical Analysis') print ('=====================================') seaborn.factorplot(x='urbgrps', y='lifgrps', data=data2, kind="bar", ci=None) plt.xlabel('Urban rate') plt.ylabel('Proportion of countries in group with lifeexpectancy less than 65 years')
#Inferencial Statistic
print ('=====================================') print ('Inferencial Statistical Analysis') print ('=====================================')
# contingency table of observed counts ct1=pandas.crosstab(data2['lifgrps'], data2['urbgrps']) print (ct1)
# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
# chi-square value, p value, expected counts print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)
# Post-hoc Test
print ('=============================') print ('Post-hoc test') print ('New threshold for p-value : ') print (0.05/10) print ('1 vs 2') recode1_2 = {1:1, 2:2} data2['comp1v2'] = data2['urbgrps'].map(recode1_2)
ct1_2 = pandas.crosstab(data2['lifgrps'], data2['comp1v2']) print (ct1_2)
colsum1_2 = ct1_2.sum(axis=0) colpct1_2 = ct1_2/colsum1_2 print (colpct1_2)
print ('chi-square value, p value, expected counts') cs1_2 = scipy.stats.chi2_contingency(ct1_2) print (cs1_2)
# 1 vs 3 print ('1 vs 3') recode = {1:1, 3:3} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 1 vs 4 print ('1 vs 4') recode = {1:1, 4:4} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 1 vs 5 print ('1 vs 5') recode = {1:1, 5:5} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 2 vs 3 print ('2 vs 3') recode = {2:2, 3:3} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 2 vs 4 print ('2 vs 4') recode = {2:2, 4:4} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 2 vs 5 print ('2 vs 5') recode = {2:2, 5:5} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 3 vs 4 print ('3 vs 4') recode = {3:3, 4:4} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 3 vs 5 print ('3 vs 5') recode = {3:3, 5:5} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
# 4 vs 5 print ('4 vs 5') recode = {4:4, 5:5} data2['comp'] = data2['urbgrps'].map(recode)
ct = pandas.crosstab(data2['lifgrps'], data2['comp']) print (ct)
colsum = ct.sum(axis=0) colpct = ct/colsum print (colpct)
print ('chi-square value, p value, expected counts') cs = scipy.stats.chi2_contingency(ct) print (cs)
0 notes
Text
Assignment-2.1 (Course-02 : Week-01)
Course – 02, Week - 01
Assignment – 2_1
Dataset : Gapminder
Variables under consideration :
· urbanrate (explanatory variable for both research questions) : it is collapsed into groups of [0-10], [10-20],…………[80-90] and [90-100].
· alcconsumption (responsive variable for 1st research question)
· lifeexpectancy (responsive variable for 2nd research question)
Redefining Research Questions and Related Hypothesis
Research Question – 01
The per capita alcohol consumption of a country depends on its urban rate.
H0 : For all groups of the urban-rate, the group-wise mean per-capita alcohol consumptions are equal
HA : The mean per-capita alcohol consumption of all the groups are not equal
Research Question – 02
The average life-expectancy of new born baby in a country depends on its urban rate.
H0 : For all groups of the urban-rate, the group-wise mean life expectancies are equal
HA : The mean life expectancies of all the groups are not equal
Code
# -*- coding: utf-8 -*-
"""
Created on Fri Jun 19 16:35:22 2020
@author: ASUS
"""
# Assignment_3
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
#importing data
data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns
print ('Rows')
print (len(data))
print ('columns')
print (len(data.columns))
#------- Variables under consideration------#
# alcconsumption
# urbanrate
# lifeexpectancy
# Setting values to numeric
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data2 = data
# urbanrate
data2['urbgrps']=pandas.cut(data2.urbanrate,[0,10,20,30,40,50,60,70,80,90,100])
data2["urbgrps"] = data2["urbgrps"].astype('category')
# ------Analysis -------#
#------Urbgrps vs. alcconsumption---------#
print ('Urbgrps vs. alcconsumption')
#-----Descriptive analysis----#
print ('Descriptive data analysis : ')
print ('C->Q bar graph')
seaborn.factorplot(x='urbgrps', y='alcconsumption', data=data2, kind="bar", ci=None)
plt.ylabel('Per capita alcohol consumption in a year')
plt.title('Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption')
print ('Seems H0 is to be rejected')
#----Unferential statistics-----#
print ('ANOVA-F test')
sub1_1 = data2[['alcconsumption', 'urbgrps']].dropna()
model1 = smf.ols(formula='alcconsumption ~ C(urbgrps)', data=sub1_1).fit()
print (model1.summary())
m1_1= sub1_1.groupby('urbgrps').mean()
print (m1_1)
#-----Post hoc----#
print ('Post-hoc test')
post_h_1 = multi.MultiComparison(sub1_1['alcconsumption'], sub1_1['urbgrps'])
res1 = post_h_1.tukeyhsd()
print(res1.summary())
# ------- urbgrps vs. lifexpectancy -------#
print ('urbgrps vs. lifexpectancy')
print ('Descriptive statistical analysis')
print ('C->Q bar graph')
seaborn.factorplot(x='urbgrps', y='lifeexpectancy', data=data2, kind="bar", ci=None)
plt.ylabel('Life Expectancy')
plt.title('Scatterplot for the Association Between Urban Rate and life expectancy')
#----Unferential statistics-----#
print ('ANOVA-F test')
sub1_2 = data2[['lifeexpectancy', 'urbgrps']].dropna()
model2 = smf.ols(formula='lifeexpectancy ~ C(urbgrps)', data=sub1_2).fit()
print (model2.summary())
m1_2= sub1_2.groupby('urbgrps').mean()
print (m1_2)
#-----Post hoc----#
print ('Post-hoc test')
post_h_2 = multi.MultiComparison(sub1_2['lifeexpectancy'], sub1_2['urbgrps'])
res2 = post_h_2.tukeyhsd()
print(res2.summary())
Output
Rows
213
columns
16
Urbgrps vs. alcconsumption
Descriptive data analysis :
C->Q bar graph
Seems H0 is to be rejected
ANOVA-F test
__main__:40: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
__main__:41: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
__main__:42: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
OLS Regression Results
==============================================================================
Dep. Variable: alcconsumption R-squared: 0.143
Model: OLS Adj. R-squared: 0.103
Method: Least Squares F-statistic: 3.625
Date: Tue, 07 Jul 2020 Prob (F-statistic): 0.000634
Time: 15:10:23 Log-Likelihood: -537.03
No. Observations: 183 AIC: 1092.
Df Residuals: 174 BIC: 1121.
Df Model: 8
Covariance Type: nonrobust
===================================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
Intercept 5.8516 0.328 17.843 0.000 5.204 6.499
C(urbgrps)[T.Interval(10, 20, closed='right')] -0.5483 1.249 -0.439 0.661 -3.014 1.917
C(urbgrps)[T.Interval(20, 30, closed='right')] -1.9921 0.949 -2.100 0.037 -3.865 -0.119
C(urbgrps)[T.Interval(30, 40, closed='right')] -1.1990 0.930 -1.289 0.199 -3.035 0.637
C(urbgrps)[T.Interval(40, 50, closed='right')] 0.1864 0.990 0.188 0.851 -1.767 2.140
C(urbgrps)[T.Interval(50, 60, closed='right')] 1.6110 0.930 1.731 0.085 -0.225 3.447
C(urbgrps)[T.Interval(60, 70, closed='right')] 3.0216 0.819 3.691 0.000 1.406 4.637
C(urbgrps)[T.Interval(70, 80, closed='right')] 2.0307 0.949 2.140 0.034 0.158 3.903
C(urbgrps)[T.Interval(80, 90, closed='right')] 3.2649 0.990 3.299 0.001 1.312 5.218
C(urbgrps)[T.Interval(90, 100, closed='right')] -0.5236 1.361 -0.385 0.701 -3.209 2.162
==============================================================================
Omnibus: 8.172 Durbin-Watson: 1.832
Prob(Omnibus): 0.017 Jarque-Bera (JB): 8.039
Skew: 0.500 Prob(JB): 0.0180
Kurtosis: 3.233 Cond. No. 1.37e+16
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.1e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
alcconsumption
urbgrps
(0, 10] nan
(10, 20] 5.303333
(20, 30] 3.859545
(30, 40] 4.652609
(40, 50] 6.038000
(50, 60] 7.462609
(60, 70] 8.873226
(70, 80] 7.882273
(80, 90] 9.116500
(90, 100] 5.328000
Post-hoc test
Multiple Comparison of Means - Tukey HSD, FWER=0.05
========================================================
group1 group2 meandiff p-adj lower upper reject
--------------------------------------------------------
(10, 20] (20, 30] -1.4438 0.9 -6.7067 3.8191 False
(10, 20] (30, 40] -0.6507 0.9 -5.8731 4.5716 False
(10, 20] (40, 50] 0.7347 0.9 -4.6203 6.0896 False
(10, 20] (50, 60] 2.1593 0.9 -3.0631 7.3816 False
(10, 20] (60, 70] 3.5699 0.381 -1.4161 8.5559 False
(10, 20] (70, 80] 2.5789 0.8146 -2.684 7.8418 False
(10, 20] (80, 90] 3.8132 0.389 -1.5418 9.1681 False
(10, 20] (90, 100] 0.0247 0.9 -6.2546 6.3039 False
(20, 30] (30, 40] 0.7931 0.9 -3.5803 5.1665 False
(20, 30] (40, 50] 2.1785 0.8319 -2.3525 6.7094 False
(20, 30] (50, 60] 3.6031 0.1993 -0.7703 7.9765 False
(20, 30] (60, 70] 5.0137 0.0051 0.9255 9.1019 True
(20, 30] (70, 80] 4.0227 0.1067 -0.399 8.4445 False
(20, 30] (80, 90] 5.257 0.0104 0.726 9.7879 True
(20, 30] (90, 100] 1.4685 0.9 -4.1246 7.0615 False
(30, 40] (40, 50] 1.3854 0.9 -3.0984 5.8692 False
(30, 40] (50, 60] 2.81 0.5145 -1.5145 7.1345 False
(30, 40] (60, 70] 4.2206 0.0329 0.1847 8.2565 True
(30, 40] (70, 80] 3.2297 0.3364 -1.1437 7.6031 False
(30, 40] (80, 90] 4.4639 0.052 -0.0199 8.9477 False
(30, 40] (90, 100] 0.6754 0.9 -4.8796 6.2304 False
(40, 50] (50, 60] 1.4246 0.9 -3.0592 5.9084 False
(40, 50] (60, 70] 2.8352 0.4672 -1.3709 7.0413 False
(40, 50] (70, 80] 1.8443 0.9 -2.6866 6.3752 False
(40, 50] (80, 90] 3.0785 0.4877 -1.559 7.716 False
(40, 50] (90, 100] -0.71 0.9 -6.3898 4.9698 False
(50, 60] (60, 70] 1.4106 0.9 -2.6253 5.4465 False
(50, 60] (70, 80] 0.4197 0.9 -3.9537 4.7931 False
(50, 60] (80, 90] 1.6539 0.9 -2.8299 6.1377 False
(50, 60] (90, 100] -2.1346 0.9 -7.6896 3.4204 False
(60, 70] (70, 80] -0.991 0.9 -5.0792 3.0973 False
(60, 70] (80, 90] 0.2433 0.9 -3.9628 4.4494 False
(60, 70] (90, 100] -3.5452 0.4859 -8.8786 1.7881 False
(70, 80] (80, 90] 1.2342 0.9 -3.2967 5.7651 False
(70, 80] (90, 100] -2.5543 0.8772 -8.1474 3.0388 False
(80, 90] (90, 100] -3.7885 0.4813 -9.4683 1.8913 False
urbgrps vs. lifexpectancy
Descriptive statistical analysis
C->Q bar graph
ANOVA-F test
OLS Regression Results
==============================================================================
Dep. Variable: lifeexpectancy R-squared: 0.406
Model: OLS Adj. R-squared: 0.380
Method: Least Squares F-statistic: 15.32
Date: Tue, 07 Jul 2020 Prob (F-statistic): 4.76e-17
Time: 15:25:26 Log-Likelihood: -644.54
No. Observations: 188 AIC: 1307.
Df Residuals: 179 BIC: 1336.
Df Model: 8
Covariance Type: nonrobust
===================================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
Intercept 62.4438 0.521 119.810 0.000 61.415 63.472
C(urbgrps)[T.Interval(10, 20, closed='right')] -1.7713 2.041 -0.868 0.387 -5.800 2.257
C(urbgrps)[T.Interval(20, 30, closed='right')] 0.2834 1.548 0.183 0.855 -2.772 3.338
C(urbgrps)[T.Interval(30, 40, closed='right')] -1.7116 1.548 -1.106 0.270 -4.767 1.343
C(urbgrps)[T.Interval(40, 50, closed='right')] 4.5547 1.615 2.820 0.005 1.367 7.742
C(urbgrps)[T.Interval(50, 60, closed='right')] 7.0269 1.518 4.629 0.000 4.031 10.022
C(urbgrps)[T.Interval(60, 70, closed='right')] 10.4459 1.283 8.140 0.000 7.914 12.978
C(urbgrps)[T.Interval(70, 80, closed='right')] 13.4776 1.548 8.706 0.000 10.423 16.533
C(urbgrps)[T.Interval(80, 90, closed='right')] 13.9092 1.694 8.212 0.000 10.567 17.252
C(urbgrps)[T.Interval(90, 100, closed='right')] 16.2290 1.841 8.816 0.000 12.597 19.861
==============================================================================
Omnibus: 8.034 Durbin-Watson: 1.953
Prob(Omnibus): 0.018 Jarque-Bera (JB): 8.379
Skew: -0.515 Prob(JB): 0.0152
Kurtosis: 2.902 Cond. No. 7.24e+15
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.02e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
lifeexpectancy
urbgrps
(0, 10] nan
(10, 20] 60.672500
(20, 30] 62.727136
(30, 40] 60.732182
(40, 50] 66.998450
(50, 60] 69.470652
(60, 70] 72.889647
(70, 80] 75.921364
(80, 90] 76.353000
(90, 100] 78.672733
Post-hoc test
Multiple Comparison of Means - Tukey HSD, FWER=0.05
=========================================================
group1 group2 meandiff p-adj lower upper reject
---------------------------------------------------------
(10, 20] (20, 30] 2.0546 0.9 -6.56 10.6693 False
(10, 20] (30, 40] 0.0597 0.9 -8.555 8.6743 False
(10, 20] (40, 50] 6.3259 0.3698 -2.4394 15.0913 False
(10, 20] (50, 60] 8.7982 0.0384 0.2499 17.3464 True
(10, 20] (60, 70] 12.2171 0.001 4.1569 20.2774 True
(10, 20] (70, 80] 15.2489 0.001 6.6342 23.8635 True
(10, 20] (80, 90] 15.6805 0.001 6.7344 24.6266 True
(10, 20] (90, 100] 18.0002 0.001 8.7032 27.2973 True
(20, 30] (30, 40] -1.995 0.9 -9.2327 5.2428 False
(20, 30] (40, 50] 4.2713 0.6536 -3.1452 11.6878 False
(20, 30] (50, 60] 6.7435 0.0826 -0.4151 13.9022 False
(20, 30] (60, 70] 10.1625 0.001 3.5944 16.7307 True
(20, 30] (70, 80] 13.1942 0.001 5.9565 20.432 True
(20, 30] (80, 90] 13.6259 0.001 5.9966 21.2551 True
(20, 30] (90, 100] 15.9456 0.001 7.9077 23.9835 True
(30, 40] (40, 50] 6.2663 0.1729 -1.1502 13.6828 False
(30, 40] (50, 60] 8.7385 0.0054 1.5798 15.8971 True
(30, 40] (60, 70] 12.1575 0.001 5.5893 18.7256 True
(30, 40] (70, 80] 15.1892 0.001 7.9514 22.4269 True
(30, 40] (80, 90] 15.6208 0.001 7.9916 23.2501 True
(30, 40] (90, 100] 17.9406 0.001 9.9026 25.9785 True
(40, 50] (50, 60] 2.4722 0.9 -4.8671 9.8115 False
(40, 50] (60, 70] 5.8912 0.1431 -0.8734 12.6558 False
(40, 50] (70, 80] 8.9229 0.0065 1.5064 16.3394 True
(40, 50] (80, 90] 9.3546 0.0068 1.5555 17.1536 True
(40, 50] (90, 100] 11.6743 0.001 3.475 19.8735 True
(50, 60] (60, 70] 3.419 0.7444 -3.0619 9.8999 False
(50, 60] (70, 80] 6.4507 0.1144 -0.7079 13.6094 False
(50, 60] (80, 90] 6.8823 0.1057 -0.6719 14.4366 False
(50, 60] (90, 100] 9.2021 0.011 1.2353 17.1689 True
(60, 70] (70, 80] 3.0317 0.8683 -3.5364 9.5999 False
(60, 70] (80, 90] 3.4634 0.8057 -3.5339 10.4606 False
(60, 70] (90, 100] 5.7831 0.2689 -1.6576 13.2238 False
(70, 80] (80, 90] 0.4316 0.9 -7.1976 8.0609 False
(70, 80] (90, 100] 2.7514 0.9 -5.2866 10.7893 False
(80, 90] (90, 100] 2.3197 0.9 -6.0725 10.7119 False
---------------------------------------------------------
Result
Research Question 01
We see that F-value of the ANOVA-F test is 3.625 and the corresponding p-value is 0.000634, which is less than 0.05. Thus the chance of wrongly rejecting the null hypothesis (H0) is very less. Hence we can reject the null hypothesis and accept the alternate hypothesis, i.e. we conclude that alcconsumption depends on urban-rate.
From results of the post-hoc test, we see that the mean per-capita alcohol consumption is unequal for the following groups:
· [20-30] and [60-70]
· [30-40] and [60-70]
· [20-30] and [80-90]
Research Question 02
We see that F-value of the ANOVA-F test is 15.32 and the corresponding p-value is 4.76e-17, which is less than 0.05. Thus the chance of wrongly rejecting the null hypothesis (H0) is very less. Hence we can reject the null hypothesis and accept the alternate hypothesis, i.e. we conclude that lfe-expectancy depends on urban-rate.
From results of the post-hoc test, we see that the mean per-capita alcohol consumption is unequal for the following groups:
· [50-60] and [90-100]
· [20-30] and [90-100]
· [20-30] and [80-90]
· And many more
0 notes
Text
Assignment_4
Program :
# Assignment_4
import pandas import numpy import seaborn import matplotlib.pyplot as plt
#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)
#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))
#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy
# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data2 = data
# alconsumption
# initial F.D. print ('alcconsumption : alcohol consumption per adult (age 15+) in litres') print ('Description and F.D. : histogram') d1_1 = data['alcconsumption'].describe() print (d1_1) seaborn.distplot(data["alcconsumption"].dropna(), kde=True); plt.xlabel('Countries') plt.title('Per Capita Alcohol Consumption in a year') print ('Description and F.D. after collapsing into groups : bargraph') data2['alcgrps']=pandas.cut(data2.alcconsumption,[0,2.5,5.0,7.5,10.0,12.5,15.0,17.5,20.0,22.5, 25]) data2["alcgrps"] = data2["alcgrps"].astype('category') d1_2 = data2['alcgrps'].describe() print (d1_2) seaborn.countplot(x="alcgrps", data=data2) plt.xlabel('Countries') plt.title('Per Capita Alcohol Consumption in a year')
# urbanrate
print ('---------------------------') print ('urbanrate : Percentage of population living in urban areas') print ('Description and F.D. : histogram') d2_1 = data['urbanrate'].describe() print (d2_1) seaborn.distplot(data["urbanrate"].dropna(), kde=True); plt.xlabel('Countries') plt.title('Urbanrate') print ('Description and F.D. after collapsing into groups : bargraph') data2['urbgrps']=pandas.cut(data2.urbanrate,[0,10,20,30,40,50,60,70,80,90,100]) data2["urbgrps"] = data2["urbgrps"].astype('category') d1_2 = data2['utbgrps'].describe() print (d2_2) seaborn.countplot(x="urbgrps", data=data2) plt.xlabel('Countries') plt.title('Urbanrate')
# lifeexpectancy
print ('---------------------------') print ('lifeexpectancy : years in avg a new born baby would live in current situation') seaborn.distplot(data["lifeexpectancy"].dropna(), kde=True); plt.xlabel('Countries') plt.title('Life expectancy') print ('Description and F.D. after collapsing into groups : bargraph') print ('Description and F.D. after collapsing into groups : bargraph') data2['lifgrps']=pandas.cut(data2.lifeexpectancy,[0,50,60,70,80,90,100]) data2["urbgrps"] = data2["lifgrps"].astype('category') seaborn.countplot(x="lifgrps", data=data2) plt.xlabel('Countries') plt.title('Life Expectancy')
# ---- Plotting bi-variate graphd -------#
print ('----------------------') print ('x=urbanrate, y=alcconsumption') print ('Q->Q scatter chart') scat1 = seaborn.regplot(x="urbanrate", y="alcconsumption", data=data2) plt.xlabel('Urban Rate') plt.ylabel('Per capita alcohol consumption in a year') plt.title('Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption') print ('C->Q bar graph') seaborn.factorplot(x='urbgrps', y='alcconsumption', data=data2, kind="bar", ci=None) plt.ylabel('Per capita alcohol consumption in a year') plt.title('Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption')
print ('x=urbanrate, y=lifexpecatancy') print ('Q->Q scatter chart') scat1 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", data=data2) plt.xlabel('Urban Rate') plt.ylabel('Life Expectancy') plt.title('Scatterplot for the Association Between Urban Rate and life expectancy') print ('C->Q bar graph') seaborn.factorplot(x='urbgrps', y='lifeexpectancy', data=data2, kind="bar", ci=None) plt.ylabel('Life Expectancy') plt.title('Scatterplot for the Association Between Urban Rate and life expectancy')
Output and Description
Univariate Graphs
Alcconsumption:
alcconsumption : alcohol consumption per adult (age 15+) in litres Description and F.D. : histogram count 187.000000 mean 6.689412 std 4.899617 min 0.030000 25% 2.625000 50% 5.920000 75% 9.925000 max 23.010000 Name: alcconsumption, dtype: float64
 Description and F.D. after collapsing into groups : bargraph count 187 unique 9 top (0.0, 2.5] freq 45 Name: alcgrps, dtype: object
Description:
Centre : 6.689412 (mean), mode in between 2.5 and 5
Spread : 4.899617
The number of countries with higher per-capita alcohol consumption goes on decreasing as can be inferred from the graph.
Urbanrate:
urbanrate : Percentage of population living in urban areas Description and F.D. : histogram count 203.000000 mean 56.769360 std 23.844933 min 10.400000 25% 36.830000 50% 57.940000 75% 74.210000 max 100.000000 Name: urbanrate, dtype: float64 Out[58]: Text(0.5,1,'Urbanrate') 
Description and F.D. after collapsing into groups : bargraph count 203 unique 9 top (60, 70] freq 34 Name: urbgrps, dtype: object Out[61]: Text(0.5,1,'Urbanrate')
Description:
Centre : 56.769360 (mean), mode somewhat between 70 and 80.
spread : 23.844933
It seems that countries aare equally spread in the buckets representing urban rate of less than and greater than mean urban rate.
Lifexpectency:
Description:
Skewed left
Centre : between 70 to 80
Bivariate Graphs
x=urbanrate, y=alcconsumption Q->Q scatter chart Out[64]: Text(0.5,1,'Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption')
C->Q bar graph Out[65]: Text(0.5,1,'Scatterplot for the Association Between Urban Rate and per-capita alcohol consumption')
Description:
It can be clearly seen that the per-capita alcohol consumption of a country increases with the increase in its urban rate.
Hence the given dataset supports the fact that the hypothesis posed at the beginning of the course (assignment for module-1) is true.
The hypothesis was as follows: The more is the urban-rate of a country, the more is its per-capita alcohol consumption.
Output:
x=urbanrate, y=lifexpecatancy Q->Q scatter chart Out[66]: Text(0.5,1,'Scatterplot for the Association Between Urban Rate and life expectancy')
C->Q bar graph Out[67]: Text(0.5,1,'Scatterplot for the Association Between Urban Rate and life expectancy')
Desctiption:
It is clear from the graphs that the life-expectancy increases with increase in urban-rate.
0 notes
Text
Assignment_3
Week 03
Making Data Management Decissions
Dataset : GapMinder
Variables chosen : ‘alcconsumption’, ‘urbanrate’ and ‘lofeexpectancy’
Program
1. # Assignment_3
2.
3.
4. import pandas
5. import numpy
6.
7. #importing data
8. data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)
9.
10. #printing number of rows and columns
11. print ('Rows')
12. print (len(data))
13. print ('columns')
14. print (len(data.columns))
15.
16. #------- Variables under consideration------#
17. # alcconsumption
18. # urbanrate
19. # lifeexpectancy
20.
21. # Setting values to numeric
22. data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
23. data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
24. data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
25. data2 = data
26.
27. # alconsumption
28.
29. # initial F.D.
30. print ('alcconsumption : alcohol consumption per adult (age 15+) in litres')
31. c1_min = data['alcconsumption'].min()
32. c1_max = data['alcconsumption'].max()
33. print ('min and max value of alcconsumption : ')
34. print (c1_min)
35. print (c1_max)
36. # Step 1 : Setting aside missing data (not required)
37. # Step 2 : coding missing data (not required)
38. # Step 3 : creating secondary variables (not required)
39. # Step 4 : Grouping Values within individual variables
40. data2['alcgrps']=pandas.cut(data2.alcconsumption,[0,2.5,5.0,7.5,10.0,12.5,15.0,17.5,20.0,22.5, 25])
41. print ('F.D. of groups of values of variable alcconsumption :')
42. c1_grp = data2['alcgrps'].value_counts(sort=False, dropna=False)
43. print (c1_grp)
44. print ('Percentage (with NaN set aside) :')
45. p1_grp = data2['alcgrps'].value_counts(sort=False, dropna=True, normalize=True)
46. print (p1_grp)
47.
48. # urbanrate
49.
50. # Initial F.D.
51. print ('---------------------------')
52. print ('urbanrate : Percentage of population living in urban areas')
53. c2_min = data2['urbanrate'].min()
54. c2_max = data2['urbanrate'].max()
55. print ('min and max values : ')
56. print (c2_min)
57. print (c2_max)
58. data2['urbgrps']=pandas.cut(data2.urbanrate,[0,10,20,30,40,50,60,70,80,90,100])
59. c2_grp = data2['urbgrps'].value_counts(sort=False, dropna=False)
60. print ('F.D. of groups of values of variable urbanrate :')
61. print (c2_grp)
62. print ('Percentage (with NaN set aside) : ')
63. p2_grp = data2['urbgrps'].value_counts(sort=False, dropna=True, normalize=True)
64. print (p2_grp)
65.
66. # lifeexpectancy
67.
68. # Initial F.D.
69. print ('---------------------------')
70. print ('lifeexpectancy : years in avg a new born baby would live in current situation')
71. c3_min = data2['lifeexpectancy'].min()
72. c3_max = data2['lifeexpectancy'].max()
73. print ('min and max values : ')
74. print (c3_min)
75. print (c3_max)
76. data2['lifgrps']=pandas.cut(data2.lifeexpectancy,[0,50,60,70,80,90,100])
77. c3_grp = data2['lifgrps'].value_counts(sort=False, dropna=False)
78. print ('F.D. of groups of values of variable lifeexpectancy :')
79. print (c3_grp)
80. print ('Percentage (with NaN set aside) : ')
81. p3_grp = data2['lifgrps'].value_counts(sort=False, dropna=True, normalize=True)
82. print (p3_grp)
Output
Rows
213
columns
16
alcconsumption : alcohol consumption per adult (age 15+) in litres
min and max value of alcconsumption :
0.03
23.01
F.D. of groups of values of variable alcconsumption :
(0.0, 2.5] 45
(2.5, 5.0] 36
(5.0, 7.5] 31
(7.5, 10.0] 30
(10.0, 12.5] 21
(12.5, 15.0] 13
(15.0, 17.5] 8
(17.5, 20.0] 2
(20.0, 22.5] 0
(22.5, 25.0] 1
NaN 26
Name: alcgrps, dtype: int64
Percentage (with NaN set aside) :
(0.0, 2.5] 0.240642
(2.5, 5.0] 0.192513
(5.0, 7.5] 0.165775
(7.5, 10.0] 0.160428
(10.0, 12.5] 0.112299
(12.5, 15.0] 0.069519
(15.0, 17.5] 0.042781
(17.5, 20.0] 0.010695
(20.0, 22.5] 0.000000
(22.5, 25.0] 0.005348
Name: alcgrps, dtype: float64
---------------------------
urbanrate : Percentage of population living in urban areas
min and max values :
10.4
100.0
F.D. of groups of values of variable urbanrate :
(0.0, 10.0] 0
(10.0, 20.0] 13
(20.0, 30.0] 22
(30.0, 40.0] 24
(40.0, 50.0] 22
(50.0, 60.0] 24
(60.0, 70.0] 34
(70.0, 80.0] 24
(80.0, 90.0] 21
(90.0, 100.0] 19
NaN 10
Name: urbgrps, dtype: int64
Percentage (with NaN set aside) :
(0, 10] 0.000000
(10, 20] 0.064039
(20, 30] 0.108374
(30, 40] 0.118227
(40, 50] 0.108374
(50, 60] 0.118227
(60, 70] 0.167488
(70, 80] 0.118227
(80, 90] 0.103448
(90, 100] 0.093596
Name: urbgrps, dtype: float64
---------------------------
lifeexpectancy : years in avg a new born baby would live in current situation
min and max values :
47.794
83.39399999999999
F.D. of groups of values of variable lifeexpectancy :
(0.0, 50.0] 9
(50.0, 60.0] 29
(60.0, 70.0] 38
(70.0, 80.0] 92
(80.0, 90.0] 23
(90.0, 100.0] 0
NaN 22
Name: lifgrps, dtype: int64
Percentage (with NaN set aside) :
(0, 50] 0.047120
(50, 60] 0.151832
(60, 70] 0.198953
(70, 80] 0.481675
(80, 90] 0.120419
(90, 100] 0.000000
Name: lifgrps, dtype: float64
Description
1. The values of the variables were grouped and F.D. (Frequency Distribution) of them are provided in output.
2. Frequency distributions:
a. alcconsumption : There are 26 unknown (NaN) values. The number (distribution) of countries (individuals) goes on decreasing consistently as the per capita alcohol consumption increases.
b. urbanrate : There are 10 missing (NaN) values. No country has less tan 10% urbanrate and 9.3596% countries have urban rate greater than 90% and less than or equal to 100%. The F.D. does not follow a consistent (increasing or decreasing) curve as alcconsumption.
c. lifeexpectancy : There are 22 missing values. 4% countries have lifeexpectancy less than and equal to 50%. The number of countries increases as the lifeexpectancy increases util 70-80 years. Then the number of countries distributed in intervals with increasing values of ifeexpectancy goes on decreasing.
0 notes
Text
Assignment_1
Week 01
Choosing a Research Question
Dataset :
I have chosen the GapMinder dataset
After a long research and juggling with different topics and their association, I finally am sticking to the following topics and research question:
Topics:
1. Alcohol consumption per adult (15+ years)
2. Urban population percentage
Research Question:
Do the countries with more people living in the urban areas report more per capita alcohol consumption?
Hypothesis:
More is the percentage of population of a country settled in the urban area, more is the per-capita alcohol consumption in that country.
This hypothesis would be proved to be true or false.
Codebook:
It contains information on the two variables, alcconsumption and urbanrate.
Research Works Readings
1. As pre the ‘Rural, Suburban, and Urban Variations in Alcohol Consumption in the United States: Findings From the National Epidemiologic Survey on Alcohol and Related Conditions’: Abstinence is particularly common in the rural South, whereas alcohol disorders and excessive drinking are more problematic in the urban and rural Midwest. Health policies and interventions should be further targeted toward those places with higher risks of problem drinking
2. As per the ‘Patterns and trends of alcohol consumption in rural and urban areas of China: findings from the China Kadoorie Biobank’: At baseline, 33% of men drank alcohol at least weekly (i.e., current regular), compared to only 2% of women. In men, current regular drinking was more common in urban (38%) than in rural (29%) areas at baseline. Among men, the proportion of current regular drinkers slightly decreased at resurvey (33% baseline vs. 29% resurvey), while the proportion of ex-regular drinkers slightly increased (4% vs. 6%), particularly among older men, with more than half of ex-regular drinkers stopping for health reasons. Among current regular drinkers, the proportion engaging in heavy episodic drinking (i.e., > 60 g/session) increased (30% baseline vs. 35% resurvey) in both rural (29% vs. 33%) and urban (31% vs. 36%) areas, particularly among younger men born in the 1970s (41% vs. 47%). Alcohol intake involved primarily spirits, at both baseline and resurvey. Those engaging in heavy drinking episodes tended to have multiple other health-related risk factors (e.g., regular smoking, low fruit intake, low physical activity and hypertension).
Conclusions from research readings:
In China, urban population is somewhat ahead in drinking as compared to the rural population, whereas in USA, the trend is not much uniform (i.e. in South urbn population is consumes more alcohol than rural whereas in mid-west there is no difference between the two classes of population in terms of alcohol consumption).
1 note
·
View note