shskpadhy - Tumblr blog

shskpadhy · 5 years ago

Text

Assignment 03-04 (Testing a Logistic Regression Model)

Dataset : Gapminder

Variables

The following derived variabbles (obtained by categorizing provided variables) are used:

lifgrps (response variable) : derived from lifexpectancy by setting the calue for lifexpectancy greater than or equal to 65 as 1 else 0.

urbgrps (primary explanatory variable) : derived from urbanrate, for individuals with urbanrate more than mean (urb_mean), value is 1 else 0.

alcgrps : derived from alcconsumption, for individuals with alcconsumption more than mean (alc_mean), value is 1 else 0.

incgrps : derived from incomeperperson, for individuals with incomeperperson more than mean (inc_mean), value is 1 else 0.

relgrps : derived from relectricperperson, for individuals with relectricperperson more than mean (rel_mean), value is 1 else 0.

The explanation of the variables were provided in the previous post.

Research Question

H0 :There does not exist an association between urbanrate and lifexpectancy

H1 : Lifexpectancy increases with urbanrate

Here we would test the research question as:

H1 : Number of countries with lifgrps = 1 in urbgrps = 1 category is more than that of the category urbgrps = 0.

H0 : There does not exist such association

Output

Rows 213 columns 16 =================================== Logistic Regression Modelling =================================== lreg1 : lifgrps ~ urbgrps Optimization terminated successfully. Current function value: 0.591261 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: lifgrps No. Observations: 213 Model: Logit Df Residuals: 211 Method: MLE Df Model: 1 Date: Fri, 24 Jul 2020 Pseudo R-squ.: 0.07575 Time: 23:20:52 Log-Likelihood: -125.94 converged: True LL-Null: -136.26 Covariance Type: nonrobust LLR p-value: 5.534e-06 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 0.0202 0.201 0.101 0.920 -0.374 0.414 urbgrps[T.1L] 1.3552 0.308 4.400 0.000 0.751 1.959 ================================================================================= Odds Ratios Intercept 1.020408 urbgrps[T.1L] 3.877391 dtype: float64 odd ratios with 95% confidence intervals Lower CI Upper CI OR Intercept 0.688125 1.513145 1.020408 urbgrps[T.1L] 2.120087 7.091294 3.877391 -------------------------------- Optimization terminated successfully. Current function value: 0.625930 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: lifgrps No. Observations: 213 Model: Logit Df Residuals: 211 Method: MLE Df Model: 1 Date: Fri, 24 Jul 2020 Pseudo R-squ.: 0.02155 Time: 23:20:52 Log-Likelihood: -133.32 converged: True LL-Null: -136.26 Covariance Type: nonrobust LLR p-value: 0.01537 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.4055 0.179 2.265 0.024 0.055 0.756 alcgrps 0.7419 0.313 2.371 0.018 0.129 1.355 ============================================================================== Hence lifgrps is not associated with alcgrps -------------------------------- Optimization terminated successfully. Current function value: 0.631670 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: lifgrps No. Observations: 213 Model: Logit Df Residuals: 211 Method: MLE Df Model: 1 Date: Fri, 24 Jul 2020 Pseudo R-squ.: 0.01258 Time: 23:20:53 Log-Likelihood: -134.55 converged: True LL-Null: -136.26 Covariance Type: nonrobust LLR p-value: 0.06407 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 0.4841 0.175 2.772 0.006 0.142 0.826 incgrps[T.1L] 0.5788 0.318 1.819 0.069 -0.045 1.203 ================================================================================= Hence lifgrps is not associated with alcgrps --------------------------------

Logit Regression Results ============================================================================== Dep. Variable: lifgrps No. Observations: 213 Model: Logit Df Residuals: 211 Method: MLE Df Model: 1 Date: Fri, 24 Jul 2020 Pseudo R-squ.: 0.01258 Time: 23:20:53 Log-Likelihood: -134.55 converged: True LL-Null: -136.26 Covariance Type: nonrobust LLR p-value: 0.06407 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 0.4841 0.175 2.772 0.006 0.142 0.826 relgrps[T.1L] 0.5788 0.318 1.819 0.069 -0.045 1.203 ================================================================================= Hence lifgrps is not associated with alcgrps --------------------------------

Summary

The logistic regression model with lifgrps as response variable and urbgrps as explanatory variable depicts that lifgrps is well associated with urbgrps. The following statistics were obtained from summary:

p-value : less than 0.0001

odds : (2.120087, 7.091294, 3.877391) for lower, upper and OR. 95% confidence interval was taken

Then association of lifgrps with alcgrps, incgrps and relgrps was also tested individually but the results showed that no association exists as can be interpreted by higher p-values.

Regarding the research question, the null hypothesis can be rejected as there is enough evidence against it, as can be seen from significant p-values. Thus there is an association between lifgrps and urbgrps.

Finally The Code

import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import scipy.stats

#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)

#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))

#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy # incomeperperson

# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True) data['relectricperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)

data2 = data

# Categorizing lifeexpectancy as lifgrps

def lifgrps (row): if row['lifeexpectancy'] >= 65 : return 1 else : return 0 data2['lifgrps'] = data2.apply (lambda row: lifgrps (row),axis=1) data2['lifgrps'] = data2['lifgrps'].convert_objects(convert_numeric=True)

# Logistic Regression Modelling

print ('===================================') print ('Logistic Regression Modelling') print ('===================================')

# Categorizing urbanrate as urbgrps

urb_mean = data2['urbanrate'].mean() def urbgrps (row): if row['urbanrate'] <= urb_mean : return 0 else : return 1 data2['urbgrps'] = data2.apply (lambda row: urbgrps (row),axis=1) data2["urbgrps"] = data2["urbgrps"].astype('category')

print ('lreg1 : lifgrps ~ urbgrps') lreg1 = smf.logit(formula = 'lifgrps ~ urbgrps', data = data2).fit() print (lreg1.summary())

# odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params))

# odd ratios with 95% confidence intervals print ('odd ratios with 95% confidence intervals') params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

print ('--------------------------------')

# categorizing alcconsumption into grps alc_mean = data2['alcconsumption'].mean() def alcgrps (row): if row['alcconsumption'] >= alc_mean : return 1 else : return 0 data2['alcgrps'] = data2.apply (lambda row: alcgrps (row),axis=1) data2['alcgrps'] = data2['alcgrps'].convert_objects(convert_numeric=True)

lreg2_1 = smf.logit(formula = 'lifgrps ~ alcgrps', data = data2).fit() print (lreg2_1.summary())

print ('Hence lifgrps is not associated with alcgrps')

print ('--------------------------------')

# categorizing incomeperperson inc_mean = data2['incomeperperson'].mean() def incgrps (row): if row['incomeperperson'] <= inc_mean : return 0 else : return 1 data2['incgrps'] = data2.apply (lambda row: incgrps (row),axis=1) data2["incgrps"] = data2["incgrps"].astype('category')

lreg3_1 = smf.logit(formula = 'lifgrps ~ incgrps', data = data2).fit() print (lreg3_1.summary())

print ('Hence lifgrps is not associated with alcgrps')

print ('--------------------------------')

# Categorizing urbanrate as urbgrps

rel_mean = data2['relectricperperson'].mean() def relgrps (row): if row['relectricperperson'] <= rel_mean : return 0 else : return 1 data2['relgrps'] = data2.apply (lambda row: relgrps (row),axis=1) data2["relgrps"] = data2["relgrps"].astype('category')

lreg4_1 = smf.logit(formula = 'lifgrps ~ relgrps', data = data2).fit() print (lreg4_1.summary())

print ('Hence lifgrps is not associated with alcgrps')

print ('--------------------------------')

0 notes

shskpadhy · 5 years ago

Text

Assignment 03_03 (Test A Multiple Regression Model)

Dataset

GapMinder

Variables

· alcconsumption (response variable) : per capita (age : 15+ years) alcohol consumption of a country

· urbanrate (primary explanatory variable) : percentage of population of the country settled in urban areas

· internetuserate : number of people per 100, who have access to the world wide web

· incomeperperson : Gross Domestic Product per capita in constant 2000 US$

Primary Research Question

H0 : there is no association between alcconsumption and urbanrate

HA : alcconsumption increases with urbanrate

Summary

· Considering the research question, we get enough evidence against H0 and hence conclude that alcconsumption increases with urbanrate. This can be seen from the summary of reg1 (specification with urbanrate as the only explanatory variable and alcconsumption as response variable), which has p-value 0.000171 and r2 value 0.075. The equation of regression is y = 6.8453 + 0.0591*xurb_c (urb_c is the urbanrate with mean centered at 0). The standardized residuals plot tells that the model is acceptable as less than 5% residuals fall outside |y|=2. Still the r2 value is very less.

· In reg2, int_c (internetuserate with mean centred at 0) was added, seeking greater r2. It was found that internetuserate confounded urbanrate. r2 = 0.303 and p-value = 1.8*10-14 Also equation : y = 7.7316 + 0.1077*xint_c + 0.0010*xint_c**2 From the regression diagnostic plots also it is clear that the model is acceptable.

· Taking the quest ahead, in reg3, inc_c (incomeperperson with mean centred at 0) was taken into account. Now int_c2 got confounded by inc_c, hence we remove it. r2= 0.338 and p-value= 3.96e-16 Regression line : y = 6.7842 + 0.1484*int_c + 0.0002*inc_c Again reg3 can be accepted as is evident from the diagnostic plots (<5% standard residuals fall beyond |y|=2).

· Hence the final model is reg3, it does not include the primary explanatory variable.

· Considering the regression diagnostic plots:

o QQ-plots for reg2 and reg3 are almost similar

o QQ-plot for reg1 shows bit more deviation indicating more errors

o Again standard residuals plot for reg2 and reg3 are similar with only 3 (1.71%) observations falling beyond |y|=2 , which is 4 (2.29%) for reg1. As both the percentages are below 5%, the models are acceptable.

o The leverage plot of reg3shows that the residuals decrease with increase in inc_c and also with increase in int_c. This means the residuals are not independent of the values of the explanatory variables. Thus there must be another explanatory variable that is also associated with the response variable alcconsumption.

Output

================================== Multiple Regression Modelling ================================== Starting with primary explanatory variavle : urbanrate OLS Regression Results ============================================================================== Dep. Variable: alcconsumption R-squared: 0.075 Model: OLS Adj. R-squared: 0.070 Method: Least Squares F-statistic: 14.73 Date: Mon, 20 Jul 2020 Prob (F-statistic): 0.000171 Time: 17:50:32 Log-Likelihood: -543.98 No. Observations: 183 AIC: 1092. Df Residuals: 181 BIC: 1098. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 6.8453 0.353 19.409 0.000 6.149 7.541 urb_c 0.0591 0.015 3.838 0.000 0.029 0.089 ============================================================================== Omnibus: 10.025 Durbin-Watson: 1.958 Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.257 Skew: 0.573 Prob(JB): 0.00592 Kurtosis: 3.178 Cond. No. 23.0 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. --------------------------------------

Including internetuserate

============================================================================== Dep. Variable: alcconsumption R-squared: 0.303 Model: OLS Adj. R-squared: 0.296 Method: Least Squares F-statistic: 38.13 Date: Mon, 20 Jul 2020 Prob (F-statistic): 1.80e-14 Time: 17:52:17 Log-Likelihood: -504.45 No. Observations: 178 AIC: 1015. Df Residuals: 175 BIC: 1024. Df Model: 2 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 7.7316 0.485 15.937 0.000 6.774 8.689 int_c 0.1077 0.013 8.497 0.000 0.083 0.133 I(int_c ** 2) -0.0010 0.000 -2.116 0.036 -0.002 -6.72e-05 ============================================================================== Omnibus: 6.860 Durbin-Watson: 2.042 Prob(Omnibus): 0.032 Jarque-Bera (JB): 7.862 Skew: 0.301 Prob(JB): 0.0196 Kurtosis: 3.836 Cond. No. 1.67e+03 ==============================================================================

As r-square value for int_c is greater than urb_c, we remove urb_c from our model Current r-square : 0.303, p-value : 1.80e-14 ------------------------------------------------------------------------------------ Considering incomeperperson

============================================================================== Dep. Variable: alcconsumption R-squared: 0.338 Model: OLS Adj. R-squared: 0.330 Method: Least Squares F-statistic: 43.90 Date: Mon, 20 Jul 2020 Prob (F-statistic): 3.96e-16 Time: 17:59:34 Log-Likelihood: -491.08 No. Observations: 175 AIC: 988.2 Df Residuals: 172 BIC: 997.7 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 6.7842 0.311 21.815 0.000 6.170 7.398 int_c 0.1484 0.018 8.063 0.000 0.112 0.185 inc_c -0.0002 5.02e-05 -3.614 0.000 -0.000 -8.23e-05 ============================================================================== Omnibus: 5.758 Durbin-Watson: 2.004 Prob(Omnibus): 0.056 Jarque-Bera (JB): 6.258 Skew: 0.273 Prob(JB): 0.0438 Kurtosis: 3.749 Cond. No. 1.05e+04 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.05e+04. This might indicate that there are strong multicollinearity or other numerical problems.

As int_c**2 get confounded by inc_c, we remobe int_c**2 Current r-square : 0.338, p-value : 3.96e-16 ---------------------------------------------------------- ---------------------------------------------------------- ============================================ Regression Diagnostic Plots ============================================ QQ-plots for all stages

-------------------------------------

Standard Residuals for all stages

reg1

4 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem

reg2

3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem

reg3

3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem For reg2 amd reg3,there is one extreme outlier each (falls above y=3) ----------------------------------------------------- Leverage Plot

Finally The Code

# Assignment 03-03

import numpy import pandas import statsmodels.api as sm import seaborn import statsmodels.formula.api as smf import matplotlib.pyplot as plt

#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)

#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))

#------- Variables under consideration------# # alcconsumption (response variable) # urbanrate (primary explanatory variable) # Internetuserate # incomeperperson

# Setting values to numeric data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)

# Multiple Regression Modelling

print ('==================================') print ('Multiple Regression Modelling') print ('==================================')

# urbanrate and alcconsumption print ('Starting with primary explanatory variavle : urbanrate')

data['urb_c'] = data['urbanrate'] - data['urbanrate'].mean()

reg1 = smf.ols(formula = 'alcconsumption ~ urb_c', data=data).fit() print (reg1.summary())

print ('--------------------------------------')

# Including internetuserate

print ('Including internetuserate')

data['int_c'] = data['internetuserate'] - data['internetuserate'].mean()

reg2 = smf.ols(formula = 'alcconsumption ~ int_c + I(int_c**2)', data=data).fit() print (reg2.summary())

print ('Current r-square : 0.075, p-value : 0.000171')

print ('--------------------------------------')

# Considering internetuserate

print ('Considering internetuserate')

data['int_c'] = data['internetuserate'] - data['internetuserate'].mean()

reg2 = smf.ols(formula = 'alcconsumption ~ int_c + I(int_c**2)', data=data).fit() print (reg2.summary())

print ('As r-square value for int_c is greater than urb_c, we remove urb_c from our model') print ('Current r-square : 0.303, p-value : 1.80e-14')

print ('------------------------------------------------------------------------------------')

# Considering incomeperperson

print ('Considering incomeperperson')

data['inc_c'] = data['incomeperperson'] - data['incomeperperson'].mean()

reg3 = smf.ols(formula = 'alcconsumption ~ int_c + inc_c', data=data).fit() print (reg3.summary())

print ('As int_c**2 get confounded by inc_c, we remobe int_c**2') print ('Current r-square : 0.338, p-value : 3.96e-16')

print ('----------------------------------------------------------') print ('----------------------------------------------------------')

# Regression Diagnostic Plots

print ('============================================') print ('Regression Diagnostic Plots') print ('============================================')

print ('QQ-plots for all stages') err_qq_1 = sm.qqplot(reg1.resid, line='r') err_qq_2 = sm.qqplot(reg2.resid, line='r') err_qq_3 = sm.qqplot(reg3.resid, line='r') print ('-------------------------------------')

print ('Standardized residuals for all stages')

print ('reg1') std_res_1 = pandas.DataFrame(reg1.resid_pearson) err_std_1 = plt.plot(std_res_1, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_1) print ('4 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')

print ('reg2') std_res_2 = pandas.DataFrame(reg2.resid_pearson) err_std_2 = plt.plot(std_res_2, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_2) print ('3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')

print ('reg3') std_res_3 = pandas.DataFrame(reg3.resid_pearson) err_std_3 = plt.plot(std_res_3, 'o', ls='none') I = plt.axhline(y=0, color='r') I = plt.axhline(y=2, color='g') I = plt.axhline(y=-2, color='g') plt.ylabel('Std Res') plt.xlabel('# of ovs') print (err_std_3) print ('3 residuals fall above y=2; less than 5% (since total $ of obs>100), hence no problem')

print ('For reg2 amd reg3,there is one extreme outlier each (falls above y=3)')

print ('-----------------------------------------------------')

# Leverage Plot print ('Leverage Plot') err_lev_1 = plt.figure() err_lev_1 = sm.graphics.plot_regress_exog(reg3, 'int_c', fig=err_lev_1) err_lev_1

err_lev_2 = plt.figure() err_lev_2 = sm.graphics.plot_regress_exog(reg3, 'inc_c', fig=err_lev_2) err_lev_2

0 notes

shskpadhy · 5 years ago

Text

Assignment 03-02 (Testing A Basic Linear Regression Model)

Dataset : GapMinder

Variables

urbanrate (primary explanatory variable) : centered at 0 and stored in urb_c, this urb_c is used for modelling the linear regression model

lifeexpectancy (response variable)

Summary

slope : 0.2628

intercept : 69.6752

Hence y = 69.6752 + 0.2628x, where x = urban rate - 56.7693596059(mean urban rate) and y = expected life-expectancy

Value of r-squared = 0.375, F-statistic = 104.6 and p-value = 1.61e-19 and correlation coefficien t = 0.6127112161764898

Output

Rows 213 columns 16 ================================== Regression Modelling ================================== Centring explanatory variable urbanrate and storing it into variable urb_c mean of urbanrate is 56.7693596059 Describing urb_c count 203.000000 mean 0.000000 std 23.844933 min -46.369360 25% -19.939360 50% 1.170640 75% 17.440640 max 43.230640 Name: urb_c, dtype: float64 Finally the regression model OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.375 Model: OLS Adj. R-squared: 0.372 Method: Least Squares F-statistic: 104.6 Date: Fri, 17 Jul 2020 Prob (F-statistic): 1.61e-19 Time: 17:46:01 Log-Likelihood: -610.02 No. Observations: 176 AIC: 1224. Df Residuals: 174 BIC: 1230. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 69.6752 0.589 118.201 0.000 68.512 70.839 urb_c 0.2628 0.026 10.227 0.000 0.212 0.313 ============================================================================== Omnibus: 11.099 Durbin-Watson: 1.866 Prob(Omnibus): 0.004 Jarque-Bera (JB): 12.104 Skew: -0.637 Prob(JB): 0.00235 Kurtosis: 2.827 Cond. No. 23.0 ==============================================================================

Running Pearsons Correlation Test for getting value of correlation coefficient (0.6127112161764898, 1.607809724025055e-19)

Finally The Code

# -*- coding: utf-8 -*- """ Created on Fri Jul 17 16:01:24 2020

@author: ASUS """

import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import seaborn import scipy

#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)

#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))

#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy

# Regression Modelling

print ('==================================') print ('Regression Modelling') print ('==================================')

# centring explanatory variable urbanrate and storing it into variable urb_c print ('Centring explanatory variable urbanrate and storing it into variable urb_c') urb_m = data['urbanrate'].mean() print ('mean of urbanrate is ') print (urb_m)

def urb_c (row): return row['urbanrate']-urb_m

data['urb_c'] = data.apply (lambda row: urb_c (row), axis=1)

print ('Describing urb_c') temp = data['urb_c'].describe() print (temp)

data_c=data.dropna()

print ('Finally the regression model') reg1 = smf.ols(formula = 'lifeexpectancy ~ urb_c', data=data_c).fit() print (reg1.summary())

scat3 = seaborn.regplot(x="urb_c", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate centered at 0') plt.ylabel('Life-expectancy')

print ('Running Pearsons Correlation Test for getting value of correlation coefficient') print (scipy.stats.pearsonr(data_c['urb_c'], data_c['lifeexpectancy']))

0 notes

shskpadhy · 5 years ago

Text

Assignment 03-01 (Writing About Your Data)

Sample

The sample is taken from the GapMinder dataset.

As collective information is taken about each country (observation), i.e the individuals are the countries, it is an aggregate level analysis.

Data Collection Procedure

The goal of GapMinder Foundations is to fight devastating ignorance with a fact based world view that everyone can understand, using inferences from the data collected.

The dataset contains data on variables like life-expectancy, urban-rate, per-capita alcohol consumption (for people above 15 years), income per person, HIV rate, number of breast-cancer cases per 100 thousand women, etc for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas.

The data is collected by GapMinder from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

As the GapMinder foundations has not carried out any survey or experiment on its own, nor has it observed the population, it can be concluded that the study design generating the data is data-reporting.

The data regarding different variables of the dataset was taken from different sources at different times, there is no information regarding when the whole data was collected.

Clearly the data is not an experimental data as no explanatory variable is manipulated. Rather it is an observational data.

Variables

0 notes

shskpadhy · 5 years ago

Text

Assignment 03-01 (Writing About Your Data)

Sample

The GapMinder dataset is used.

Data Collection Procedure

Clearly the data is not an experimental data as no explanatory variable is manipulated. Rather it is an observational data.

Variables

alcconsumption : it measures the amount of pure alcohol consumed by an individual (age 15+) in litres in a year.

0 notes

shskpadhy · 5 years ago

Text

Assignment 02_04 (Course-02, Week-04 : Testing A Potential Moderator)

Dataset : GapMinder

Variiables

urbanrate (explanatory variable)

lifeexpectancy (response variable)

alcgrps : alcconsumption collapsed into 4 groups containing 1st, 2nd, 3rd and 4th quartiles

Summary

The last row shows the correlation coefficient and p-value for the Pearson Correlation Coefficient Test where the whole dataset is considered.

The direction of the association does not appear to change with the moderation variable but the strength appears to change slightly.

Thus the moderator does not strongly alter the association between the two variables.

Output

Rows 213 columns 16 =============================================================== urbanrate vs lifeexpectancy without moderator ===============================================================

Overall analysis

association between urbanrate and lifeexpectancy (0.6075222955616916, 1.2247333760171806e-05)

=============================================================== urbanrate vs lifeexpectancy with alcconsumption as moderator =============================================================== alcgrps=1 : upto 25%ile from bottom

(0.6235967761519358, 3.6639464319432514e-06) --------------------------------------------------------

alcgrps=1 : between 25%ile and 50%ile from bottom

(0.45570230600060746, 0.0018802338429809984) ---------------------------------------

alcgrps=1 : between 50%ile and 75%ile from bottom

(0.5786487797163785, 5.9669496046994585e-05) ---------------------------------------

alcgrps=1 : between 75%ile and 100%ile from bottom

(0.6075222955616916, 1.2247333760171806e-05) ---------------------------------------

Finally The Code

import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt

#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)

#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))

#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy

# urbanrate vs. lifeexpectancy

print ('===============================================================') print ('urbanrate vs lifeexpectancy without moderator') print ('===============================================================')

data_c=data_sub.dropna()

scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('Scatterplot for the Association Between Urban-rate amd Life-expectancy')

print ('Overall analysis') print ('association between urbanrate and lifeexpectancy') print (scipy.stats.pearsonr(data_c['urbanrate'], data_c['lifeexpectancy']))

# urbanrate vs. lifeexpectancy with alcconsumption as moderator

print ('===============================================================') print ('urbanrate vs lifeexpectancy with alcconsumption as moderator') print ('===============================================================')

#temp = data['alcconsumption'].describe() #print (temp)

# collapsing alcconsumption into groups

def alcgrps (row): if row['alcconsumption'] <= 2.625000: return 1 elif row['alcconsumption'] <= 5.920000: return 2 elif row['alcconsumption'] <= 9.925000: return 3 else: return 4

data['alcgrps'] = data.apply (lambda row: alcgrps (row), axis=1)

# alcgrpups = 1

print ('alcgrps=1 : upto 25%ile from bottom')

data_sub = data[(data['alcgrps']==1)] #print (data_sub['alcgrps'])#.describe())

scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=1')

data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))

print ('--------------------------------------------------------')

# alcgrpups = 2

print ('alcgrps=1 : between 25%ile and 50%ile from bottom')

data_sub = data[(data['alcgrps']==2)] #print (data_sub['alcgrps'])#.describe())

scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=2')

data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))

print ('---------------------------------------')

# alcgrpups = 3

print ('alcgrps=1 : between 50%ile and 75%ile from bottom')

data_sub = data[(data['alcgrps']==3)] #print (data_sub['alcgrps'])#.describe())

scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=3')

data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))

print ('---------------------------------------')

# alcgrpups = 4

print ('alcgrps=1 : between 75%ile and 100%ile from bottom')

data_sub = data[(data['alcgrps']==4)] #print (data_sub['alcgrps'])#.describe())

scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data_sub) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('alcgrps=4')

data_clean=data_sub.dropna() print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))

print ('---------------------------------------')

0 notes

shskpadhy · 5 years ago

Text

Assignment 02_03 (Course-02, Week-03)

Generating Correlation Coefficient

Dataset : GapMinder

Variables :

urbanrate : % of population of the country living in urban areas

alcconsumption : per capita (age : 15+) alcohol consumption in a year

lifeexpectancy : how many years a normal new born baby would live in current situation

Research Question (Hypothesis)

Research Question 1

H0 : urbanrate and alcconsumption are independent of each other

HA : alcconsumption increases with urbanrate

Research Question 2

H0 : urbanrate and lifeexpectancy are independent of each other

HA : lifeexpectancy increases with urbanrate

Summary

Research Question 1

Value of correlation coefficient = 0.27446605904089333, thus there exists a positive correlation between urbanrate and alcconsumption, i.e. alcconsumption increases with urbanrate.

p-value = 0.00022753282212695448, which is less than 0.05

From points 1 and 2 above, it is concluded that H0 (null hypothesis) is rejected and HA (alternate hypothesis) is accepted

Although knowing the value of urbanrate, we can determine only 7.5331% variability in alcconsumption.

Research Question 2

Value of correlation coefficient = 0.6127112161764898, thus there exists a positive correlation between urbanrate and lifeexpectancy, i.e. lifeexpectamcy increases with urbanrate.

p-value = 1.607809724025055e-19, which is very much less than 0.05

From points 1 and 2 above, it is concluded that H0 (null hypothesis) is rejected and HA (alternate hypothesis) is accepted

Although knowing the value of urbanrate, we can determine only 37.5415% ariability in lifeexpectancy.

Output

Rows 213 columns 16 =================================== urbanrate vs alcconsumption =================================== Descriptive analysis

---------------------------- Inferencial analysis association between urbanrate and alcconsumption (0.27446605904089333, 0.00022753282212695448)

=================================== urbanrate vs lifeexpectancy =================================== Descriptive analysis

---------------------------- Inferencial analysis association between urbanrate and lifeexpectancy (0.6127112161764898, 1.607809724025055e-19)

Finally The Code

import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt

#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)

#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))

#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy

data_clean=data.dropna()

# urbanrate vs. alcconsumption

print ('===================================') print ('urbanrate vs alcconsumption') print ('===================================')

print ('Descriptive analysis') scat1 = seaborn.regplot(x="urbanrate", y="alcconsumption", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Per Capita Alcohol Consumption') plt.title('Scatterplot for the Association Between Urban Rate and Alcohol Consumption')

print ('----------------------------') print ('Inferencial analysis') print ('association between urbanrate and alcconsumption') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['alcconsumption']))

# urbanrate vs. lifeexpectancy

print ('===================================') print ('urbanrate vs lifeexpectancy') print ('===================================')

print ('Descriptive analysis') scat2 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Life-expectancy') plt.title('Scatterplot for the Association Between Urban-rate amd Life-expectancy')

print ('----------------------------') print ('Inferencial analysis') print ('association between urbanrate and lifeexpectancy') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))

0 notes

shskpadhy · 5 years ago

Text

Assignment-02_02 (Course-02, Module-02)

Running Chi-Square Independence Test

Dataset : GapMinder

Variables

urbanrate (urbgrps after collapsing into categories)

lifeexpectancy ( lifgrps after collapsing into categories)

urbgrps : urbanrate is collapsed into categories of width 20%, hence having 5 categories (1 for 0%-20%, 2 for 20% to 40%, .... and 5 for 80% to 100%)

lifgrps : It is a two-valued variable, with value 1 if the country has life-expectancy greater than or equal ti 65% and 0 otherwise.

Research Question : Hypothesis

H0 : proportion of countries in each category of urbgrps having life-expectancy of 65+ years is equal

HA : The proportion is unequal for atleast two groups

Summary

The p-value for the Chi-square test is 0.000008, which is significantly less than 0.05. Hence we can reject the null hypothesis.

The following table shows the p-values of chi-square tests for each pair of categories.

There are 10 comparisons, hence our threshold p-value must be 0.05/10 = 0.005

It is clear that category-4 (countries with urban-rate between 60% and 80%) has significantly different life-expectancy

Output

Rows 213 columns 16 ===================================== Descriptive Statistical Analysis =====================================

===================================== Inferencial Statistical Analysis ===================================== urbgrps 1 2 3 4 5 lifgrps 0 9 24 17 6 16 1 4 22 29 52 34 G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:46: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:47: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:48: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) G:/Data Science/Data Analysis and Interpretation (Coursera)/Codes/02-02-Assignment.py:75: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. return 0 urbgrps 1 2 3 4 5 lifgrps 0 0.692308 0.521739 0.369565 0.103448 0.320000 1 0.307692 0.478261 0.630435 0.896552 0.680000 chi-square value, p value, expected counts (28.770249676716467, 8.704036143643654e-06, 4L, array([[ 4.3943662 , 15.54929577, 15.54929577, 19.6056338 , 16.90140845], [ 8.6056338 , 30.45070423, 30.45070423, 38.3943662 , 33.09859155]])) ============================= Post-hoc test New threshold for p-value : 0.005 1 vs 2 comp1v2 1.000000 2.000000 lifgrps 0 9 24 1 4 22 comp1v2 1.000000 2.000000 lifgrps 0 0.692308 0.521739 1 0.307692 0.478261 chi-square value, p value, expected counts (0.6044210109845561, 0.43689616446201507, 1L, array([[ 7.27118644, 25.72881356], [ 5.72881356, 20.27118644]])) 1 vs 3 comp 1.000000 3.000000 lifgrps 0 9 17 1 4 29 comp 1.000000 3.000000 lifgrps 0 0.692308 0.369565 1 0.307692 0.630435 chi-square value, p value, expected counts (3.073965958790373, 0.07955517016836518, 1L, array([[ 5.72881356, 20.27118644], [ 7.27118644, 25.72881356]])) 1 vs 4 comp 1.000000 4.000000 lifgrps 0 9 6 1 4 52 comp 1.000000 4.000000 lifgrps 0 0.692308 0.103448 1 0.307692 0.896552 chi-square value, p value, expected counts (18.70646985916383, 1.524642950087753e-05, 1L, array([[ 2.74647887, 12.25352113], [10.25352113, 45.74647887]])) 1 vs 5 comp 1.000000 5.000000 lifgrps 0 9 16 1 4 34 comp 1.000000 5.000000 lifgrps 0 0.692308 0.320000 1 0.307692 0.680000 chi-square value, p value, expected counts (4.520721862348178, 0.03348669970950359, 1L, array([[ 5.15873016, 19.84126984], [ 7.84126984, 30.15873016]])) 2 vs 3 comp 2.000000 3.000000 lifgrps 0 24 17 1 22 29 comp 2.000000 3.000000 lifgrps 0 0.521739 0.369565 1 0.478261 0.630435 chi-square value, p value, expected counts (1.5839311334289814, 0.20819535296566563, 1L, array([[20.5, 20.5], [25.5, 25.5]])) 2 vs 4 comp 2.000000 4.000000 lifgrps 0 24 6 1 22 52 comp 2.000000 4.000000 lifgrps 0 0.521739 0.103448 1 0.478261 0.896552 chi-square value, p value, expected counts (19.878233856044954, 8.253469598112272e-06, 1L, array([[13.26923077, 16.73076923], [32.73076923, 41.26923077]])) 2 vs 5 comp 2.000000 5.000000 lifgrps 0 24 16 1 22 34 comp 2.000000 5.000000 lifgrps 0 0.521739 0.320000 1 0.478261 0.680000 chi-square value, p value, expected counts (3.2246459627329176, 0.07253748022033618, 1L, array([[19.16666667, 20.83333333], [26.83333333, 29.16666667]])) 3 vs 4 comp 3.000000 4.000000 lifgrps 0 17 6 1 29 52 comp 3.000000 4.000000 lifgrps 0 0.369565 0.103448 1 0.630435 0.896552 chi-square value, p value, expected counts (9.059129050611572, 0.0026138645525417546, 1L, array([[10.17307692, 12.82692308], [35.82692308, 45.17307692]])) 3 vs 5 comp 3.000000 5.000000 lifgrps 0 17 16 1 29 34 comp 3.000000 5.000000 lifgrps 0 0.369565 0.320000 1 0.630435 0.680000 chi-square value, p value, expected counts (0.08745341614906833, 0.7674399219661803, 1L, array([[15.8125, 17.1875], [30.1875, 32.8125]])) 4 vs 5 comp 4.000000 5.000000 lifgrps 0 6 16 1 52 34 comp 4.000000 5.000000 lifgrps 0 0.103448 0.320000 1 0.896552 0.680000 chi-square value, p value, expected counts (6.485275205948824, 0.01087716976280061, 1L, array([[11.81481481, 10.18518519], [46.18518519, 39.81481481]]))

Finslly The Code

import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import scipy.stats

#importing data data = pandas.read_csv('Dataset_gapminder.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%f'%x)

#printing number of rows and columns print ('Rows') print (len(data)) print ('columns') print (len(data.columns))

#------- Variables under consideration------# # alcconsumption # urbanrate # lifeexpectancy

# Categorizing urbanrate as urbgrps

def urbgrps (row): if row['urbanrate'] <= 20 : return 1 elif row['urbanrate'] <= 40 : return 2 elif row['urbanrate'] <= 60 : return 3 elif row['urbanrate'] <= 80 : return 4 else : return 5 data2['urbgrps'] = data2.apply (lambda row: urbgrps (row),axis=1) data2["urbgrps"] = data2["urbgrps"].astype('category')

# Categorizing lifeexpectancy as lifgrps

def lifgrps (row): if row['lifeexpectancy'] >= 65 : return 1 else : return 0 data2['lifgrps'] = data2.apply (lambda row: lifgrps (row),axis=1) # Setting urbgrps to numeric data2['lifgrps'] = data2['lifgrps'].convert_objects(convert_numeric=True)

# Descriptive Analysis

print ('=====================================') print ('Descriptive Statistical Analysis') print ('=====================================') seaborn.factorplot(x='urbgrps', y='lifgrps', data=data2, kind="bar", ci=None) plt.xlabel('Urban rate') plt.ylabel('Proportion of countries in group with lifeexpectancy less than 65 years')

#Inferencial Statistic

print ('=====================================') print ('Inferencial Statistical Analysis') print ('=====================================')

# contingency table of observed counts ct1=pandas.crosstab(data2['lifgrps'], data2['urbgrps']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square value, p value, expected counts print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

# Post-hoc Test

print ('=============================') print ('Post-hoc test') print ('New threshold for p-value : ') print (0.05/10) print ('1 vs 2') recode1_2 = {1:1, 2:2} data2['comp1v2'] = data2['urbgrps'].map(recode1_2)

ct1_2 = pandas.crosstab(data2['lifgrps'], data2['comp1v2']) print (ct1_2)

colsum1_2 = ct1_2.sum(axis=0) colpct1_2 = ct1_2/colsum1_2 print (colpct1_2)

print ('chi-square value, p value, expected counts') cs1_2 = scipy.stats.chi2_contingency(ct1_2) print (cs1_2)

# 1 vs 3 print ('1 vs 3') recode = {1:1, 3:3} data2['comp'] = data2['urbgrps'].map(recode)