#0-1037(0) error code
Explore tagged Tumblr posts
panjinkhoma-blog · 7 years ago
Text
Week 4 - Logistic Regression
To perform logistic regression, I created an independent variable with more than 2 groups (4). I used the 2008 residential electricity consumption per person which I divided into 0 = Very Low, 1 = Low, 2 = Medium and 3 = High. My response variable was internet use rate.
PYTHON CODE
import numpy import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm import seaborn as sb import statsmodels.formula.api as smf
data = pd.read_csv("gapminder.csv", low_memory=False, na_values = " ")
# bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%.2f'%x)
#Setting my variables of interest to numeric
data["incomeperperson"] = pd.to_numeric(data["incomeperperson"], errors='coerce') data["alcconsumption"] = pd.to_numeric(data["alcconsumption"], errors='coerce') data["femaleemployrate"] = pd.to_numeric(data["femaleemployrate"], errors='coerce') data["hivrate"] = pd.to_numeric(data["hivrate"], errors='coerce') data["suicedeper100th"] = pd.to_numeric(data["suicideper100th"], errors='coerce') data["employrate"] = pd.to_numeric(data["employrate"], errors='coerce') data["urbanrate"] = pd.to_numeric(data["urbanrate"], errors='coerce') data["lifeexpectancy"] = pd.to_numeric(data["lifeexpectancy"], errors='coerce') data["internetuserate"] = pd.to_numeric(data["internetuserate"], errors='coerce') data["relectricperperson"] = pd.to_numeric(data["relectricperperson"], errors='coerce')
############################################################################### #DATA MANAGEMENT ###############################################################################
# Management for urbanrate def urbanrategrp (row):    if row["urbanrate"] <= 57.94:        return 0    elif row["urbanrate"] > 57.94:        return 1
data['urbanrategrp'] = data.apply (lambda row: urbanrategrp(row),axis=1)
print("Counts (Frequencies) for urbanrategrp") c1 = data["urbanrategrp"].value_counts(sort = False, dropna = True) print(c1)  
# Management for incomeperperson def incomeperpersongrp (row):    if row["incomeperperson"] <=2553.50:        return 0    elif row["incomeperperson"] > 2553.50:        return 1
data["incomeperpersongrp"] = data.apply (lambda row: incomeperpersongrp(row),axis=1)
print("Counts (Frequencies) for incomeperpersongrp") c2 = data["incomeperpersongrp"].value_counts(sort = False, dropna = True) print(c2)  
# Management for employrate def employrategrp (row):    if row["employrate"] <= 58.70:        return 0    elif row["employrate"] > 58.70:        return 1
data["employrategrp"] = data.apply (lambda row: employrategrp(row),axis=1)
print("Counts (Frequencies) for employrategrp") c3 = data["employrategrp"].value_counts(sort = False, dropna = True) print(c3)
# Management for relectricperperson to have more than 2 categories data["relectric"] = pd.cut(data.relectricperperson, bins=[0, 203.65, 597.14, 1491.15, 11154.76], labels=["0", "1", "2", "3"])
print("Counts (Frequencies) for relectric") c4 = data["relectric"].value_counts(sort = False, dropna = True) print(c4)
# Create new dataset sub1 = data.copy()
############################################################################### # MULTIPLE REGRESSION & CONFIDENCE INTERVALS ###############################################################################
# Regression model for the association between urbanisation & internet use rate print("OLS Regression Model for the association between urbanrategrp & internetuserate") reg1 = smf.ols("internetuserate ~ urbanrategrp", data=sub1).fit() print(reg1.summary())
# Adding incomeperperson as an explanatory variable   # center quantitative IVs for regression analysis sub1['incomeperpersongrp_c'] = (sub1['incomeperpersongrp'] - sub1['incomeperpersongrp'].mean()) print (sub1['incomeperpersongrp_c'].mean())
sub1['urbanrate_c'] = (sub1['urbanrate'] - sub1['urbanrate'].mean()) print (sub1['urbanrate_c'].mean())
# multiple regression analysis with urbanrate & incomeperperson reg2 = smf.ols('internetuserate ~ urbanrategrp + incomeperpersongrp', data=sub1).fit() print (reg2.summary())
#multiple regression analysis with urbanrate & incomeperperson plus employmentrate sub1['employrate_c'] = (sub1['employrate'] - sub1['employrate'].mean()) print (sub1['employrate_c'].mean())
reg3 = smf.ols('internetuserate ~ urbanrategrp + incomeperpersongrp + employrate_c', data=sub1).fit() print (reg3.summary())
############################################################################## # CATEGORICAL VARIABLES WITH 3+ CATEGORIES ##############################################################################
# adding 4 category relectricperpersongrp. Reference group coding is called "Treatment" coding in python # and the default reference catergory is the group with a value = 0 (Very low)
reg4 = smf.ols('internetuserate ~ urbanrategrp + incomeperpersongrp + employrate_c+ C(relectric)', data=sub1).fit() print (reg4.summary())
# can override the default ad specify a different reference group # low (1) as reference group
reg5 = smf.ols('internetuserate ~ urbanrategrp + incomeperpersongrp + employrate_c+ C(relectric, Treatment(reference=1))', data=sub1).fit() print (reg5.summary())
############################################################################## # LOGISTIC REGRESSION ##############################################################################
# logistic regression with income per person # For this analysalyis, i want to determine association between urbanization and income per person # urbanrate will be the response variable with two levels # Income per person will be my Independent variable which is also categorical with 2 levels
from scipy import stats stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
lreg1 = smf.logit(formula = 'urbanrategrp ~ incomeperpersongrp', data = sub1).fit() print (lreg1.summary())
# odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params))
# odd ratios with 95% confidence intervals params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression with income per person and  employment rate lreg2 = smf.logit(formula = 'urbanrategrp ~ incomeperpersongrp + employrategrp', data = sub1).fit() print (lreg2.summary())
# odd ratios with 95% confidence intervals params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
OUTPUT
Counts (Frequencies) for urbanrategrp 0.00    102 1.00    101 Name: urbanrategrp, dtype: int64 Counts (Frequencies) for incomeperpersongrp 0.00    95 1.00    95 Name: incomeperpersongrp, dtype: int64 Counts (Frequencies) for employrategrp 0.00    89 1.00    89 Name: employrategrp, dtype: int64 Counts (Frequencies) for relectric 0    29 1    34 2    34 3    34 Name: relectric, dtype: int64 OLS Regression Model for the association between urbanrategrp & internetuserate                            OLS Regression Results                             ============================================================================== Dep. Variable:        internetuserate   R-squared:                       0.305 Model:                            OLS   Adj. R-squared:                  0.301 Method:                 Least Squares   F-statistic:                     82.32 Date:                Sun, 27 May 2018   Prob (F-statistic):           1.55e-16 Time:                        11:13:37   Log-Likelihood:                -866.59 No. Observations:                 190   AIC:                             1737. Df Residuals:                     188   BIC:                             1744. Df Model:                           1                                         Covariance Type:            nonrobust                                         ================================================================================                   coef    std err          t      P>|t|      [0.025      0.975] -------------------------------------------------------------------------------- Intercept       20.5337      2.363      8.689      0.000      15.872      25.195 urbanrategrp    30.6460      3.378      9.073      0.000      23.983      37.309 ============================================================================== Omnibus:                        6.785   Durbin-Watson:                   1.933 Prob(Omnibus):                  0.034   Jarque-Bera (JB):                7.040 Skew:                           0.460   Prob(JB):                       0.0296 Kurtosis:                       2.790   Cond. No.                         2.59 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. 0.0 1.8446109445594718e-14                            OLS Regression Results                             ============================================================================== Dep. Variable:        internetuserate   R-squared:                       0.530 Model:                            OLS   Adj. R-squared:                  0.524 Method:                 Least Squares   F-statistic:                     100.8 Date:                Sun, 27 May 2018   Prob (F-statistic):           4.78e-30 Time:                        11:13:38   Log-Likelihood:                -795.87 No. Observations:                 182   AIC:                             1598. Df Residuals:                     179   BIC:                             1607. Df Model:                           2                                         Covariance Type:            nonrobust                                         ======================================================================================                         coef    std err          t      P>|t|      [0.025      0.975] -------------------------------------------------------------------------------------- Intercept             13.1658      2.136      6.163      0.000       8.950      17.381 urbanrategrp          13.0717      3.601      3.630      0.000       5.966      20.177 incomeperpersongrp    31.4597      3.600      8.739      0.000      24.356      38.563 ============================================================================== Omnibus:                        0.300   Durbin-Watson:                   1.884 Prob(Omnibus):                  0.861   Jarque-Bera (JB):                0.457 Skew:                          -0.034   Prob(JB):                        0.796 Kurtosis:                       2.764   Cond. No.                         4.08 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. -2.674514791905995e-15                            OLS Regression Results                             ============================================================================== Dep. Variable:        internetuserate   R-squared:                       0.532 Model:                            OLS   Adj. R-squared:                  0.523 Method:                 Least Squares   F-statistic:                     60.56 Date:                Sun, 27 May 2018   Prob (F-statistic):           3.25e-26 Time:                        11:13:38   Log-Likelihood:                -715.81 No. Observations:                 164   AIC:                             1440. Df Residuals:                     160   BIC:                             1452. Df Model:                           3                                         Covariance Type:            nonrobust                                         ======================================================================================                         coef    std err          t      P>|t|      [0.025      0.975] -------------------------------------------------------------------------------------- Intercept             12.4532      2.299      5.417      0.000       7.913      16.993 urbanrategrp          15.6193      4.071      3.837      0.000       7.579      23.660 incomeperpersongrp    28.3856      4.003      7.091      0.000      20.480      36.292 employrate_c          -0.0230      0.152     -0.151      0.880      -0.324       0.278 ============================================================================== Omnibus:                        0.084   Durbin-Watson:                   1.833 Prob(Omnibus):                  0.959   Jarque-Bera (JB):                0.229 Skew:                          -0.013   Prob(JB):                        0.892 Kurtosis:                       2.819   Cond. No.                         35.5 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.                            OLS Regression Results                             ============================================================================== Dep. Variable:        internetuserate   R-squared:                       0.743 Model:                            OLS   Adj. R-squared:                  0.730 Method:                 Least Squares   F-statistic:                     57.30 Date:                Sun, 27 May 2018   Prob (F-statistic):           8.35e-33 Time:                        11:13:38   Log-Likelihood:                -511.27 No. Observations:                 126   AIC:                             1037. Df Residuals:                     119   BIC:                             1056. Df Model:                           6                                         Covariance Type:            nonrobust                                         ======================================================================================                         coef    std err          t      P>|t|      [0.025      0.975] -------------------------------------------------------------------------------------- Intercept              8.2848      2.905      2.852      0.005       2.533      14.037 C(relectric)[T.1]     11.7375      4.290      2.736      0.007       3.243      20.232 C(relectric)[T.2]     33.7211      5.122      6.584      0.000      23.579      43.863 C(relectric)[T.3]     52.0191      5.486      9.483      0.000      41.157      62.881 urbanrategrp           1.0812      3.899      0.277      0.782      -6.639       8.801 incomeperpersongrp    11.1245      4.214      2.640      0.009       2.781      19.468 employrate_c           0.2236      0.156      1.430      0.155      -0.086       0.533 ============================================================================== Omnibus:                        1.298   Durbin-Watson:                   1.803 Prob(Omnibus):                  0.522   Jarque-Bera (JB):                1.052 Skew:                          -0.222   Prob(JB):                        0.591 Kurtosis:                       3.050   Cond. No.                         58.6 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.                            OLS Regression Results                             ============================================================================== Dep. Variable:        internetuserate   R-squared:                       0.743 Model:                            OLS   Adj. R-squared:                  0.730 Method:                 Least Squares   F-statistic:                     57.30 Date:                Sun, 27 May 2018   Prob (F-statistic):           8.35e-33 Time:                        11:13:38   Log-Likelihood:                -511.27 No. Observations:                 126   AIC:                             1037. Df Residuals:                     119   BIC:                             1056. Df Model:                           6                                         Covariance Type:            nonrobust                                         =============================================================================================================                                                coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------------------------------------- Intercept                                    20.0223      3.178      6.301      0.000      13.730      26.315 C(relectric, Treatment(reference=1))[T.0]   -11.7375      4.290     -2.736      0.007     -20.232      -3.243 C(relectric, Treatment(reference=1))[T.2]    21.9836      3.786      5.806      0.000      14.486      29.481 C(relectric, Treatment(reference=1))[T.3]    40.2816      4.073      9.891      0.000      32.217      48.346 urbanrategrp                                  1.0812      3.899      0.277      0.782      -6.639       8.801 incomeperpersongrp                           11.1245      4.214      2.640      0.009       2.781      19.468 employrate_c                                  0.2236      0.156      1.430      0.155      -0.086       0.533 ============================================================================== Omnibus:                        1.298   Durbin-Watson:                   1.803 Prob(Omnibus):                  0.522   Jarque-Bera (JB):                1.052 Skew:                          -0.222   Prob(JB):                        0.591 Kurtosis:                       3.050   Cond. No.                         39.1 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Optimization terminated successfully.         Current function value: 0.501584         Iterations 5                           Logit Regression Results                           ============================================================================== Dep. Variable:           urbanrategrp   No. Observations:                  189 Model:                          Logit   Df Residuals:                      187 Method:                           MLE   Df Model:                            1 Date:                Sun, 27 May 2018   Pseudo R-squ.:                  0.2762 Time:                        11:13:40   Log-Likelihood:                -94.799 converged:                       True   LL-Null:                       -130.98                                        LLR p-value:                 1.790e-17 ======================================================================================                         coef    std err          z      P>|z|      [0.025      0.975] -------------------------------------------------------------------------------------- Intercept             -1.4404      0.262     -5.495      0.000      -1.954      -0.927 incomeperpersongrp     2.7621      0.363      7.601      0.000       2.050       3.474 ====================================================================================== Odds Ratios Intercept             0.24 incomeperpersongrp   15.83 dtype: float64                    Lower CI  Upper CI    OR Intercept               0.14      0.40  0.24 incomeperpersongrp      7.77     32.28 15.83 Optimization terminated successfully.         Current function value: 0.452287         Iterations 6                           Logit Regression Results                           ============================================================================== Dep. Variable:           urbanrategrp   No. Observations:                  166 Model:                          Logit   Df Residuals:                      163 Method:                           MLE   Df Model:                            2 Date:                Sun, 27 May 2018   Pseudo R-squ.:                  0.3475 Time:                        11:13:40   Log-Likelihood:                -75.080 converged:                       True   LL-Null:                       -115.06                                        LLR p-value:                 4.322e-18 ======================================================================================                         coef    std err          z      P>|z|      [0.025      0.975] -------------------------------------------------------------------------------------- Intercept             -1.3836      0.368     -3.762      0.000      -2.104      -0.663 incomeperpersongrp     3.1580      0.418      7.559      0.000       2.339       3.977 employrategrp         -0.2710      0.417     -0.650      0.516      -1.089       0.547 ======================================================================================                    Lower CI  Upper CI    OR Intercept               0.12      0.52  0.25 incomeperpersongrp     10.37     53.34 23.52 employrategrp           0.34      1.73  0.76
COMMENTS
Internet use rate was significantly different between the 3 different electricity consumption groups (low, medium and high) in comparison to the default reference group (very low consumption) p = 0.007 beta = 11.7, p < 0.0001 beta = 33.7 and p < 0.0001 beta = 52.0 respectively.
After changing the reference group to the low consumption group, the very low, medium and high consumption groups were significant; p = 0.007 beta -11.7, p < 0.0001 beta = 22, p < 0.0001 beta = 40.3 respectively. The -11.7 beta for the very low consumption represents a negative association.
Urban rate was significantly associated with internet use rate p < 0.0001 beta = 15.6 but after introduction of electricity consumption the association was no longer significant p = 0.782 beta = 1. showing evidence of  confounding.
Income per person was significantly associated with internet use rate but employment rate wasn’t; p = 0.009 beta = 11.1 and p = 0.155 beta = 0.2 respectively. 
ODDS RRATIOS
After adjusting for the potential confounding factor employment rate, the odds  of internet use rate were 23.5 times higher for countries with high income than for those with low income (OR = 23.5, 95% CI = 10.37 - 53.34, p < 0001)
0 notes