#0-1037(0) error code | Explore Tumblr posts and blogs

Week 4 - Logistic Regression

To perform logistic regression, I created an independent variable with more than 2 groups (4). I used the 2008 residential electricity consumption per person which I divided into 0 = Very Low, 1 = Low, 2 = Medium and 3 = High. My response variable was internet use rate.

PYTHON CODE

import numpy import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm import seaborn as sb import statsmodels.formula.api as smf

data = pd.read_csv("gapminder.csv", low_memory=False, na_values = " ")

# bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%.2f'%x)

#Setting my variables of interest to numeric

data["incomeperperson"] = pd.to_numeric(data["incomeperperson"], errors='coerce') data["alcconsumption"] = pd.to_numeric(data["alcconsumption"], errors='coerce') data["femaleemployrate"] = pd.to_numeric(data["femaleemployrate"], errors='coerce') data["hivrate"] = pd.to_numeric(data["hivrate"], errors='coerce') data["suicedeper100th"] = pd.to_numeric(data["suicideper100th"], errors='coerce') data["employrate"] = pd.to_numeric(data["employrate"], errors='coerce') data["urbanrate"] = pd.to_numeric(data["urbanrate"], errors='coerce') data["lifeexpectancy"] = pd.to_numeric(data["lifeexpectancy"], errors='coerce') data["internetuserate"] = pd.to_numeric(data["internetuserate"], errors='coerce') data["relectricperperson"] = pd.to_numeric(data["relectricperperson"], errors='coerce')

############################################################################### #DATA MANAGEMENT ###############################################################################

# Management for urbanrate def urbanrategrp (row): if row["urbanrate"] <= 57.94: return 0 elif row["urbanrate"] > 57.94: return 1

data['urbanrategrp'] = data.apply (lambda row: urbanrategrp(row),axis=1)

print("Counts (Frequencies) for urbanrategrp") c1 = data["urbanrategrp"].value_counts(sort = False, dropna = True) print(c1)

# Management for incomeperperson def incomeperpersongrp (row): if row["incomeperperson"] <=2553.50: return 0 elif row["incomeperperson"] > 2553.50: return 1

data["incomeperpersongrp"] = data.apply (lambda row: incomeperpersongrp(row),axis=1)

print("Counts (Frequencies) for incomeperpersongrp") c2 = data["incomeperpersongrp"].value_counts(sort = False, dropna = True) print(c2)

# Management for employrate def employrategrp (row): if row["employrate"] <= 58.70: return 0 elif row["employrate"] > 58.70: return 1

data["employrategrp"] = data.apply (lambda row: employrategrp(row),axis=1)

print("Counts (Frequencies) for employrategrp") c3 = data["employrategrp"].value_counts(sort = False, dropna = True) print(c3)

# Management for relectricperperson to have more than 2 categories data["relectric"] = pd.cut(data.relectricperperson, bins=[0, 203.65, 597.14, 1491.15, 11154.76], labels=["0", "1", "2", "3"])

print("Counts (Frequencies) for relectric") c4 = data["relectric"].value_counts(sort = False, dropna = True) print(c4)

# Create new dataset sub1 = data.copy()

############################################################################### # MULTIPLE REGRESSION & CONFIDENCE INTERVALS ###############################################################################

# Regression model for the association between urbanisation & internet use rate print("OLS Regression Model for the association between urbanrategrp & internetuserate") reg1 = smf.ols("internetuserate ~ urbanrategrp", data=sub1).fit() print(reg1.summary())

# Adding incomeperperson as an explanatory variable # center quantitative IVs for regression analysis sub1['incomeperpersongrp_c'] = (sub1['incomeperpersongrp'] - sub1['incomeperpersongrp'].mean()) print (sub1['incomeperpersongrp_c'].mean())

sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean()) print (sub1['urbanrate_c'].mean())

# multiple regression analysis with urbanrate & incomeperperson reg2 = smf.ols('internetuserate ~ urbanrategrp + incomeperpersongrp', data=sub1).fit() print (reg2.summary())

#multiple regression analysis with urbanrate & incomeperperson plus employmentrate sub1['employrate_c'] = (sub1['employrate'] - sub1['employrate'].mean()) print (sub1['employrate_c'].mean())

reg3 = smf.ols('internetuserate ~ urbanrategrp + incomeperpersongrp + employrate_c', data=sub1).fit() print (reg3.summary())

############################################################################## # CATEGORICAL VARIABLES WITH 3+ CATEGORIES ##############################################################################

# adding 4 category relectricperpersongrp. Reference group coding is called "Treatment" coding in python # and the default reference catergory is the group with a value = 0 (Very low)

reg4 = smf.ols('internetuserate ~ urbanrategrp + incomeperpersongrp + employrate_c+ C(relectric)', data=sub1).fit() print (reg4.summary())

# can override the default ad specify a different reference group # low (1) as reference group

reg5 = smf.ols('internetuserate ~ urbanrategrp + incomeperpersongrp + employrate_c+ C(relectric, Treatment(reference=1))', data=sub1).fit() print (reg5.summary())

############################################################################## # LOGISTIC REGRESSION ##############################################################################

# logistic regression with income per person # For this analysalyis, i want to determine association between urbanization and income per person # urbanrate will be the response variable with two levels # Income per person will be my Independent variable which is also categorical with 2 levels

from scipy import stats stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

lreg1 = smf.logit(formula = 'urbanrategrp ~ incomeperpersongrp', data = sub1).fit() print (lreg1.summary())

# odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params))

# odd ratios with 95% confidence intervals params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with income per person and employment rate lreg2 = smf.logit(formula = 'urbanrategrp ~ incomeperpersongrp + employrategrp', data = sub1).fit() print (lreg2.summary())

# odd ratios with 95% confidence intervals params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

OUTPUT

Counts (Frequencies) for urbanrategrp 0.00 102 1.00 101 Name: urbanrategrp, dtype: int64 Counts (Frequencies) for incomeperpersongrp 0.00 95 1.00 95 Name: incomeperpersongrp, dtype: int64 Counts (Frequencies) for employrategrp 0.00 89 1.00 89 Name: employrategrp, dtype: int64 Counts (Frequencies) for relectric 0 29 1 34 2 34 3 34 Name: relectric, dtype: int64 OLS Regression Model for the association between urbanrategrp & internetuserate OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.305 Model: OLS Adj. R-squared: 0.301 Method: Least Squares F-statistic: 82.32 Date: Sun, 27 May 2018 Prob (F-statistic): 1.55e-16 Time: 11:13:37 Log-Likelihood: -866.59 No. Observations: 190 AIC: 1737. Df Residuals: 188 BIC: 1744. Df Model: 1 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept 20.5337 2.363 8.689 0.000 15.872 25.195 urbanrategrp 30.6460 3.378 9.073 0.000 23.983 37.309 ============================================================================== Omnibus: 6.785 Durbin-Watson: 1.933 Prob(Omnibus): 0.034 Jarque-Bera (JB): 7.040 Skew: 0.460 Prob(JB): 0.0296 Kurtosis: 2.790 Cond. No. 2.59 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. 0.0 1.8446109445594718e-14 OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.530 Model: OLS Adj. R-squared: 0.524 Method: Least Squares F-statistic: 100.8 Date: Sun, 27 May 2018 Prob (F-statistic): 4.78e-30 Time: 11:13:38 Log-Likelihood: -795.87 No. Observations: 182 AIC: 1598. Df Residuals: 179 BIC: 1607. Df Model: 2 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- Intercept 13.1658 2.136 6.163 0.000 8.950 17.381 urbanrategrp 13.0717 3.601 3.630 0.000 5.966 20.177 incomeperpersongrp 31.4597 3.600 8.739 0.000 24.356 38.563 ============================================================================== Omnibus: 0.300 Durbin-Watson: 1.884 Prob(Omnibus): 0.861 Jarque-Bera (JB): 0.457 Skew: -0.034 Prob(JB): 0.796 Kurtosis: 2.764 Cond. No. 4.08 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. -2.674514791905995e-15 OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.532 Model: OLS Adj. R-squared: 0.523 Method: Least Squares F-statistic: 60.56 Date: Sun, 27 May 2018 Prob (F-statistic): 3.25e-26 Time: 11:13:38 Log-Likelihood: -715.81 No. Observations: 164 AIC: 1440. Df Residuals: 160 BIC: 1452. Df Model: 3 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- Intercept 12.4532 2.299 5.417 0.000 7.913 16.993 urbanrategrp 15.6193 4.071 3.837 0.000 7.579 23.660 incomeperpersongrp 28.3856 4.003 7.091 0.000 20.480 36.292 employrate_c -0.0230 0.152 -0.151 0.880 -0.324 0.278 ============================================================================== Omnibus: 0.084 Durbin-Watson: 1.833 Prob(Omnibus): 0.959 Jarque-Bera (JB): 0.229 Skew: -0.013 Prob(JB): 0.892 Kurtosis: 2.819 Cond. No. 35.5 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.743 Model: OLS Adj. R-squared: 0.730 Method: Least Squares F-statistic: 57.30 Date: Sun, 27 May 2018 Prob (F-statistic): 8.35e-33 Time: 11:13:38 Log-Likelihood: -511.27 No. Observations: 126 AIC: 1037. Df Residuals: 119 BIC: 1056. Df Model: 6 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- Intercept 8.2848 2.905 2.852 0.005 2.533 14.037 C(relectric)[T.1] 11.7375 4.290 2.736 0.007 3.243 20.232 C(relectric)[T.2] 33.7211 5.122 6.584 0.000 23.579 43.863 C(relectric)[T.3] 52.0191 5.486 9.483 0.000 41.157 62.881 urbanrategrp 1.0812 3.899 0.277 0.782 -6.639 8.801 incomeperpersongrp 11.1245 4.214 2.640 0.009 2.781 19.468 employrate_c 0.2236 0.156 1.430 0.155 -0.086 0.533 ============================================================================== Omnibus: 1.298 Durbin-Watson: 1.803 Prob(Omnibus): 0.522 Jarque-Bera (JB): 1.052 Skew: -0.222 Prob(JB): 0.591 Kurtosis: 3.050 Cond. No. 58.6 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.743 Model: OLS Adj. R-squared: 0.730 Method: Least Squares F-statistic: 57.30 Date: Sun, 27 May 2018 Prob (F-statistic): 8.35e-33 Time: 11:13:38 Log-Likelihood: -511.27 No. Observations: 126 AIC: 1037. Df Residuals: 119 BIC: 1056. Df Model: 6 Covariance Type: nonrobust ============================================================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------------------- Intercept 20.0223 3.178 6.301 0.000 13.730 26.315 C(relectric, Treatment(reference=1))[T.0] -11.7375 4.290 -2.736 0.007 -20.232 -3.243 C(relectric, Treatment(reference=1))[T.2] 21.9836 3.786 5.806 0.000 14.486 29.481 C(relectric, Treatment(reference=1))[T.3] 40.2816 4.073 9.891 0.000 32.217 48.346 urbanrategrp 1.0812 3.899 0.277 0.782 -6.639 8.801 incomeperpersongrp 11.1245 4.214 2.640 0.009 2.781 19.468 employrate_c 0.2236 0.156 1.430 0.155 -0.086 0.533 ============================================================================== Omnibus: 1.298 Durbin-Watson: 1.803 Prob(Omnibus): 0.522 Jarque-Bera (JB): 1.052 Skew: -0.222 Prob(JB): 0.591 Kurtosis: 3.050 Cond. No. 39.1 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Optimization terminated successfully. Current function value: 0.501584 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: urbanrategrp No. Observations: 189 Model: Logit Df Residuals: 187 Method: MLE Df Model: 1 Date: Sun, 27 May 2018 Pseudo R-squ.: 0.2762 Time: 11:13:40 Log-Likelihood: -94.799 converged: True LL-Null: -130.98 LLR p-value: 1.790e-17 ====================================================================================== coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------------- Intercept -1.4404 0.262 -5.495 0.000 -1.954 -0.927 incomeperpersongrp 2.7621 0.363 7.601 0.000 2.050 3.474 ====================================================================================== Odds Ratios Intercept 0.24 incomeperpersongrp 15.83 dtype: float64 Lower CI Upper CI OR Intercept 0.14 0.40 0.24 incomeperpersongrp 7.77 32.28 15.83 Optimization terminated successfully. Current function value: 0.452287 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: urbanrategrp No. Observations: 166 Model: Logit Df Residuals: 163 Method: MLE Df Model: 2 Date: Sun, 27 May 2018 Pseudo R-squ.: 0.3475 Time: 11:13:40 Log-Likelihood: -75.080 converged: True LL-Null: -115.06 LLR p-value: 4.322e-18 ====================================================================================== coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------------- Intercept -1.3836 0.368 -3.762 0.000 -2.104 -0.663 incomeperpersongrp 3.1580 0.418 7.559 0.000 2.339 3.977 employrategrp -0.2710 0.417 -0.650 0.516 -1.089 0.547 ====================================================================================== Lower CI Upper CI OR Intercept 0.12 0.52 0.25 incomeperpersongrp 10.37 53.34 23.52 employrategrp 0.34 1.73 0.76

COMMENTS

Internet use rate was significantly different between the 3 different electricity consumption groups (low, medium and high) in comparison to the default reference group (very low consumption) p = 0.007 beta = 11.7, p < 0.0001 beta = 33.7 and p < 0.0001 beta = 52.0 respectively.

After changing the reference group to the low consumption group, the very low, medium and high consumption groups were significant; p = 0.007 beta -11.7, p < 0.0001 beta = 22, p < 0.0001 beta = 40.3 respectively. The -11.7 beta for the very low consumption represents a negative association.

Urban rate was significantly associated with internet use rate p < 0.0001 beta = 15.6 but after introduction of electricity consumption the association was no longer significant p = 0.782 beta = 1. showing evidence of confounding.

Income per person was significantly associated with internet use rate but employment rate wasn’t; p = 0.009 beta = 11.1 and p = 0.155 beta = 0.2 respectively.

ODDS RRATIOS

After adjusting for the potential confounding factor employment rate, the odds of internet use rate were 23.5 times higher for countries with high income than for those with low income (OR = 23.5, 95% CI = 10.37 - 53.34, p < 0001)