trsmit-blog - Tumblr blog

trsmit-blog · 8 years ago

Text

Logistic Regression

To prepare, I created a variable suicidemedian where 1= is greater than median of 8.26 . I wanted to create a binary variable to see what contribute to higher than median rates of self-afflicted mortality per 100,000 person in a country. I did this in Excel. I originally tested polity score groups (anocracy, democracy, and autocracy) to see it this affect it, but all their estimated coefficients had greater than 0.05, implying no statistical association. This didn’t c change even when employment rate was introduces. So I made my primary focus to urban rate and internet use in looking at models rate to test if there are statistically association with higher than median suicide rates; eventually I settled on urban rate as my primary explanatory variable of interest

After adjusting for potentially confounding factors (alcohol consumption, internet rate use, and employment rate), a one percent increase of the population living in city is associated with lower odds of a country having higher than median suicide rates on average (OR= 0.958, CI= 0.936 to 0.982, p=0.001). This negative association and its general statistical significance does not change with introduction of new variables, which suggests that the other variables do not confound this realize; evidence show potentially confounding factors exist for internet use rate. All other variables, employment rate and internet use rate, does not affect the odds of higher than median suicide rates. In the final model, we also found that alcohol consumption to be significantly associated with having higher odds to have higher than median suicide rates (OR= 1.17,CI: 1.06 to 1.29, p=0.01). We can also interpret this as an increase of alcohol liter per capita for a country’s population (aged 15 and older) has 17% more chance of being associated with higher than median suicide rate.

More steps to make this analysis more rigorous is to include more variables, which can be done.

The output:

Optimization terminated successfully. Current function value: 0.688623 Iterations 4 Logit Regression Results ============================================================================== Dep. Variable: suicidemedian No. Observations: 152 Model: Logit Df Residuals: 149 Method: MLE Df Model: 2 Date: Sun, 03 Dec 2017 Pseudo R-squ.: 0.004537 Time: 19:47:25 Log-Likelihood: -104.67 converged: True LL-Null: -105.15 LLR p-value: 0.6206 ================================================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------------------------------------ Intercept 0.1866 0.217 0.861 0.389 -0.238 0.611 C(polityscoreg, Treatment(reference=3))[T.1.0] -0.4743 0.491 -0.965 0.334 -1.437 0.489 C(polityscoreg, Treatment(reference=3))[T.2.0] -0.0531 0.369 -0.144 0.886 -0.776 0.670 ================================================================================================================== Odds Ratios Intercept 1.205128 C(polityscoreg, Treatment(reference=3))[T.1.0] 0.622340 C(polityscoreg, Treatment(reference=3))[T.2.0] 0.948328 dtype: float64 Lower CI Upper CI OR Intercept 0.788241 1.842500 1.205128 C(polityscoreg, Treatment(reference=3))[T.1.0] 0.237599 1.630090 0.622340 C(polityscoreg, Treatment(reference=3))[T.2.0] 0.460062 1.954794 0.948328 Optimization terminated successfully. Current function value: 0.679398 Iterations 4 Logit Regression Results ============================================================================== Dep. Variable: suicidemedian No. Observations: 152 Model: Logit Df Residuals: 148 Method: MLE Df Model: 3 Date: Sun, 03 Dec 2017 Pseudo R-squ.: 0.01787 Time: 19:47:25 Log-Likelihood: -103.27 converged: True LL-Null: -105.15 LLR p-value: 0.2887 ================================================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------------------------------------ Intercept -1.3885 0.976 -1.423 0.155 -3.301 0.524 C(polityscoreg, Treatment(reference=3))[T.1.0] -0.5399 0.498 -1.085 0.278 -1.515 0.435 C(polityscoreg, Treatment(reference=3))[T.2.0] -0.1797 0.381 -0.471 0.637 -0.927 0.568 employrate 0.0274 0.017 1.653 0.098 -0.005 0.060 ================================================================================================================== Lower CI Upper CI OR Intercept 0.036835 1.689407 0.249457 C(polityscoreg, Treatment(reference=3))[T.1.0] 0.219772 1.545505 0.582802 C(polityscoreg, Treatment(reference=3))[T.2.0] 0.395775 1.763889 0.835525 employrate 0.994934 1.061672 1.027761 Optimization terminated successfully. Current function value: 0.677847 Iterations 4 Logit Regression Results ============================================================================== Dep. Variable: suicidemedian No. Observations: 152 Model: Logit Df Residuals: 150 Method: MLE Df Model: 1 Date: Sun, 03 Dec 2017 Pseudo R-squ.: 0.02011 Time: 19:47:25 Log-Likelihood: -103.03 converged: True LL-Null: -105.15 LLR p-value: 0.03972 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.9310 0.441 2.111 0.035 0.066 1.796 urbanrate -0.0149 0.007 -2.027 0.043 -0.029 -0.000 ============================================================================== Odds Ratios Lower CI Upper CI OR Intercept 1.068734 6.023096 2.537141 urbanrate 0.971097 0.999509 0.985201 Optimization terminated successfully. Current function value: 0.673595 Iterations 4 Logit Regression Results ============================================================================== Dep. Variable: suicidemedian No. Observations: 152 Model: Logit Df Residuals: 148 Method: MLE Df Model: 3 Date: Sun, 03 Dec 2017 Pseudo R-squ.: 0.02626 Time: 19:47:25 Log-Likelihood: -102.39 converged: True LL-Null: -105.15 LLR p-value: 0.1373 ================================================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------------------------------------ Intercept 1.1798 0.526 2.241 0.025 0.148 2.212 C(polityscoreg, Treatment(reference=3))[T.1.0] -0.4857 0.500 -0.972 0.331 -1.465 0.494 C(polityscoreg, Treatment(reference=3))[T.2.0] -0.3191 0.397 -0.804 0.422 -1.097 0.459 urbanrate -0.0165 0.008 -2.095 0.036 -0.032 -0.001 ================================================================================================================== Odds Ratios Lower CI Upper CI OR Intercept 1.159579 9.129449 3.253662 C(polityscoreg, Treatment(reference=3))[T.1.0] 0.231042 1.638478 0.615269 C(polityscoreg, Treatment(reference=3))[T.2.0] 0.333776 1.582551 0.726786 urbanrate 0.968590 0.998936 0.983646 Optimization terminated successfully. Current function value: 0.685060 Iterations 4 Logit Regression Results ============================================================================== Dep. Variable: suicidemedian No. Observations: 152 Model: Logit Df Residuals: 150 Method: MLE Df Model: 1 Date: Sun, 03 Dec 2017 Pseudo R-squ.: 0.009688 Time: 19:47:25 Log-Likelihood: -104.13 converged: True LL-Null: -105.15 LLR p-value: 0.1535 =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept -0.1691 0.252 -0.671 0.503 -0.663 0.325 internetuserate 0.0085 0.006 1.415 0.157 -0.003 0.020 =================================================================================== Odds Ratios Lower CI Upper CI OR Intercept 0.515085 1.384324 0.844420 internetuserate 0.996733 1.020495 1.008544 Optimization terminated successfully. Current function value: 0.624638 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: suicidemedian No. Observations: 152 Model: Logit Df Residuals: 149 Method: MLE Df Model: 2 Date: Sun, 03 Dec 2017 Pseudo R-squ.: 0.09703 Time: 19:47:25 Log-Likelihood: -94.945 converged: True LL-Null: -105.15 LLR p-value: 3.707e-05 =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 1.4958 0.493 3.035 0.002 0.530 2.462 urbanrate -0.0460 0.012 -3.953 0.000 -0.069 -0.023 internetuserate 0.0355 0.009 3.747 0.000 0.017 0.054 =================================================================================== Odds Ratios Lower CI Upper CI OR Intercept 1.698807 11.723264 4.462685 urbanrate 0.933483 0.977063 0.955025 internetuserate 1.017099 1.055633 1.036187 Optimization terminated successfully. Current function value: 0.621730 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: suicidemedian No. Observations: 152 Model: Logit Df Residuals: 148 Method: MLE Df Model: 3 Date: Sun, 03 Dec 2017 Pseudo R-squ.: 0.1012 Time: 19:47:25 Log-Likelihood: -94.503 converged: True LL-Null: -105.15 LLR p-value: 9.166e-05 =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 0.3630 1.299 0.279 0.780 -2.184 2.910 urbanrate -0.0436 0.012 -3.656 0.000 -0.067 -0.020 internetuserate 0.0356 0.010 3.736 0.000 0.017 0.054 employrate 0.0168 0.018 0.935 0.350 -0.018 0.052 =================================================================================== Odds Ratios Lower CI Upper CI OR Intercept 0.112636 18.350232 1.437672 urbanrate 0.935288 0.980001 0.957384 internetuserate 1.017057 1.055736 1.036216 employrate 0.981744 1.053452 1.016967 Optimization terminated successfully. Current function value: 0.579231 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: suicidemedian No. Observations: 152 Model: Logit Df Residuals: 147 Method: MLE Df Model: 4 Date: Sun, 03 Dec 2017 Pseudo R-squ.: 0.1627 Time: 19:47:26 Log-Likelihood: -88.043 converged: True LL-Null: -105.15 LLR p-value: 6.751e-07 =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept -0.5781 1.399 -0.413 0.679 -3.320 2.164 urbanrate -0.0423 0.012 -3.423 0.001 -0.066 -0.018 internetuserate 0.0194 0.011 1.824 0.068 -0.001 0.040 employrate 0.0226 0.019 1.162 0.245 -0.016 0.061 alcconsumption 0.1606 0.048 3.312 0.001 0.066 0.256 =================================================================================== Odds Ratios Lower CI Upper CI OR Intercept 0.036148 8.704849 0.560946 urbanrate 0.935710 0.982104 0.958626 internetuserate 0.998556 1.041010 1.019562 employrate 0.984610 1.062516 1.022821 alcconsumption 1.067738 1.291226 1.174176

The code:

@author: Tofu """

import numpy import pandas import statsmodels.api as sm import seaborn import statsmodels.formula.api as smf

data = pandas.read_csv('gapminder3.csv', low_memory=False)

data['suicidemedian'] =pandas.to_numeric(data['suicidemedian'], errors='coerce') data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce') data['polityscoreg'] = pandas.to_numeric(data['polityscoreg'], errors='coerce')

data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce') data['internetuserate'] =pandas.to_numeric(data['internetuserate'], errors='coerce')

#REPLACING MISSING VALUES

data['internetuserate']=data['internetuserate'].replace(" ", numpy.nan) data['suicidemedian']=data['suicidemedian'].replace("",numpy.nan) data['alcconsumption']=data['alcconsumption'].replace(99,numpy.nan) data['armedforcesrate']=data['armedforcesrate'].replace(" ",numpy.nan) data['urbanrate']=data['urbanrate'].replace(" ",numpy.nan) data['employrate']=data['employrate'].replace(" ",numpy.nan) data['polityscoreg']=data['polityscoreg'].replace(99,numpy.nan)

sub1 = data[['suicidemedian', 'polityscoreg', 'alcconsumption', 'urbanrate', 'employrate', 'internetuserate']].dropna()

#recoding polity score

# logistic regression with polityscore lreg1 = smf.logit(formula = 'suicidemedian ~ C(polityscoreg, Treatment(reference=3))', data = sub1).fit() print (lreg1.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params))

# odd ratios with 95% confidence intervals params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with polity score and employment rate lreg2 = smf.logit(formula = 'suicidemedian ~ C(polityscoreg, Treatment(reference=3)) + employrate', data = sub1).fit() print (lreg2.summary())

# odd ratios with 95% confidence intervals params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with urban rate lreg3 = smf.logit(formula = 'suicidemedian ~ urbanrate', data = sub1).fit() print (lreg3.summary())

# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg3.params conf = lreg3.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with urban rate and polity score lreg4 = smf.logit(formula = 'suicidemedian ~ urbanrate + C(polityscoreg, Treatment (reference=3))', data = sub1).fit() print (lreg4.summary())

# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg4.params conf = lreg4.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with inter net use rate lreg5 = smf.logit(formula = 'suicidemedian ~ internetuserate', data=sub1).fit() print (lreg5.summary())

# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg5.params conf = lreg5.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with urban rate and internet use rate lreg6 = smf.logit(formula = 'suicidemedian ~ urbanrate + internetuserate', data = sub1).fit() print (lreg6.summary())

# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg6.params conf = lreg6.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with urban rate and internet use rate lreg7 = smf.logit(formula = 'suicidemedian ~ urbanrate + internetuserate + employrate', data = sub1).fit() print (lreg7.summary())

# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg7.params conf = lreg7.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

# logistic regression with urban rate and internet use rate lreg8 = smf.logit(formula = 'suicidemedian ~ urbanrate + internetuserate + employrate + alcconsumption', data = sub1).fit() print (lreg8.summary())

# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg8.params conf = lreg8.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))

#nocatteruptions

0 notes

trsmit-blog · 8 years ago

Text

Multiple Regression

I ran a multiple regression model for life expectancy to include polity score, polity score squared, alcohol consumption, and HIV rate. My previous hypothesis was that higher polity score (aka more democractic) regimes will have a positive impact. This did not present itself clearly in previous analysis, where anocracies (-5 to 5) have lower life expectancy than autocracies and democracies. Looking a scatter plot of life expectancy versus polity score, there seems to be more of a curvilinear relationship, though there are lots of variation outside of predicted lines.

I presume that higher HIV rate will be associated with lower life expectancy and that alcohol consumption will also have a negative relationship with lower life expectancy. I centered the variables to best of my degree and their post centering means were very close to zero

The model estimated

Life Expectancy = Intercept + b1*polity score + b2* (polity score squared) + b3*HIV rate + b4* alcohol consumption

Predicted:

Life Expectancy = 62.8 + 1.3*polity score + 0.2*polity score squared – 1.1*HIV rate + 0.2* alcohol consumption

All coefficient but alcohol consumption’s coefficients were statistically significant, or in other words the p-value of less than 0.05 signals that we could reject the null hypothesis of no association (n=134). For polity score, one increase in a country’s polity score is associated with 1.3 year increase in life expectancy on average; when square, an increase is associated with a smaller increase of .2 year increase in life expectancy. An increase of one person living with HIV per 100 in age group 15 to 49 for a country is associated with a decrease of 1.1 years in life expectancy on average, holding everything else constant.

Considering that the estimated coefficient on polity score remain positive and significant throughout additional variables does not signal a confounding variable. However, if I were being real rigorous, then I would try to incorporate a country’s democracy and anocracy score instead of polity score. But this exercise shows that there are positive association. Yet, the large amount of variation in the data in the previously mentioned scatter plot makes me concerned about the normality of residuals, which we need to prove.

When I plot a quantile-quantile chart of the above regression’s residuals, I see that are deviations from the line which signals the errors are not normal. When we normalized the residuals and plot them, we do also see a few extreme outliers, which again signal this model is poor fit. Looking regression plots for HIV rate, an additional variable, I see inconsistent variation, which is also can signal heteroscedasticity (variation in residuals are affected by value of explanatory variables).

\A leverage plot shows that there are outliers, though all outliers then have influence less than 0.05. Any observations that have high leverage are within 2 standard deviations. Overall, this model may show that life expectancy have a positive curvilinear association with polity score, but model needs additional factors.

The Output:

2.0447272971513674e-15

-2.444088097376244e-16

1.0031943301648897e-15

OLS Regression Results

==============================================================================

Dep. Variable: lifeexpectancy R-squared: 0.148

Model: OLS Adj. R-squared: 0.141

Method: Least Squares F-statistic: 23.71

Date: Sun, 26 Nov 2017 Prob (F-statistic): 3.05e-06

Time: 21:19:13 Log-Likelihood: -509.27

No. Observations: 139 AIC: 1023.

Df Residuals: 137 BIC: 1028.

Df Model: 1

Covariance Type: nonrobust

=================================================================================

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------

Intercept 68.4358 0.806 84.860 0.000 66.841 70.031

polityscore_c 0.6632 0.136 4.869 0.000 0.394 0.932

==============================================================================

Omnibus: 8.855 Durbin-Watson: 1.335

Prob(Omnibus): 0.012 Jarque-Bera (JB): 9.015

Skew: -0.585 Prob(JB): 0.0110

Kurtosis: 2.568 Cond. No. 5.92

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

OLS Regression Results

==============================================================================

Dep. Variable: lifeexpectancy R-squared: 0.394

Model: OLS Adj. R-squared: 0.385

Method: Least Squares F-statistic: 44.13

Date: Sun, 26 Nov 2017 Prob (F-statistic): 1.69e-15

Time: 21:19:14 Log-Likelihood: -485.60

No. Observations: 139 AIC: 977.2

Df Residuals: 136 BIC: 986.0

Df Model: 2

Covariance Type: nonrobust

=========================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------------

Intercept 61.9192 1.112 55.703 0.000 59.721 64.117

polityscore_c 1.6083 0.172 9.367 0.000 1.269 1.948

I(polityscore_c ** 2) 0.1859 0.025 7.428 0.000 0.136 0.235

==============================================================================

Omnibus: 13.030 Durbin-Watson: 1.461

Prob(Omnibus): 0.001 Jarque-Bera (JB): 13.896

Skew: -0.691 Prob(JB): 0.000961

Kurtosis: 3.701 Cond. No. 87.9

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

OLS Regression Results

==============================================================================

Dep. Variable: lifeexpectancy R-squared: 0.647

Model: OLS Adj. R-squared: 0.637

Method: Least Squares F-statistic: 61.49

Date: Sun, 26 Nov 2017 Prob (F-statistic): 2.10e-29

Time: 21:19:14 Log-Likelihood: -447.93

No. Observations: 139 AIC: 905.9

Df Residuals: 134 BIC: 920.5

Df Model: 4

Covariance Type: nonrobust

=========================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------------

Intercept 62.8458 0.874 71.904 0.000 61.117 64.575

polityscore_c 1.3346 0.147 9.060 0.000 1.043 1.626

I(polityscore_c ** 2) 0.1594 0.020 7.995 0.000 0.120 0.199

hivrate_c -1.1382 0.118 -9.632 0.000 -1.372 -0.904

alcconsumption_c 0.2065 0.114 1.806 0.073 -0.020 0.433

==============================================================================

Omnibus: 8.109 Durbin-Watson: 1.705

Prob(Omnibus): 0.017 Jarque-Bera (JB): 9.125

Skew: -0.409 Prob(JB): 0.0104

Kurtosis: 3.952 Cond. No. 90.1

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The code:

Created on Sun Nov 26 20:06:23 2017

@author: Tofu

"""

import numpy

import pandas

import matplotlib.pyplot as plt

import statsmodels.api as sm

import statsmodels.formula.api as smf

import seaborn

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%.2f'%x)

data = pandas.read_csv('gapminder2.csv')

# convert to numeric format

data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')

data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce')

data['relectric'] = pandas.to_numeric(data['relectric'], errors='coerce')

data['hivrate'] = pandas.to_numeric(data['hivrate'],errors='coerce')

data['alcconsumption']=pandas.to_numeric(data['alcconsumption'], errors='coerce')

#converting values into N/A

data['lifeexpectancy']=data["lifeexpectancy"].replace(0, numpy.nan)

data['polityscore']=data['polityscore'].replace(11, numpy.nan)

data['alcconsumption']=data['alcconsumption'].replace(99,numpy.nan)

data['hivrate']=data['hivrate'].replace(99,numpy.nan)

#creating subset

sub1 = data[['lifeexpectancy', 'polityscore', 'alcconsumption', 'hivrate']].dropna()

# first order (linear) scatterplot

scat1 = seaborn.regplot(x="polityscore", y="lifeexpectancy", scatter=True, data=sub1)

plt.xlabel('Polity Score')

plt.ylabel('Life Expectancy')

# fit second order polynomial

# run the 2 scatterplots together to get both linear and second order fit lines

scat1 = seaborn.regplot(x="polityscore", y="lifeexpectancy", scatter=True, order=2, data=sub1)

plt.xlabel('Polity Score')

plt.ylabel('Life Expectancy')

#I don't think based on the nature of this than scatter plot, though a curviinear relationship makes more sense

#Center those variables

sub1['polityscore_c'] = (sub1['polityscore'] - sub1['polityscore'].mean())

print (sub1['polityscore_c'].mean())

sub1['hivrate_c'] = (sub1['hivrate'] - sub1['hivrate'].mean())

print (sub1['hivrate_c'].mean())

sub1['alcconsumption_c'] = (sub1['alcconsumption'] - sub1['alcconsumption'].mean())

print (sub1['alcconsumption_c'].mean())

sub1[["polityscore_c", "hivrate_c","alcconsumption_c"]].describe()

# linear regression analysis

reg1 = smf.ols('lifeexpectancy ~ polityscore_c', data=sub1).fit()

print (reg1.summary())

# quadratic (polynomial) regression analysis

# run following line of code if you get PatsyError 'ImaginaryUnit' object is not callable

reg2 = smf.ols('lifeexpectancy ~ polityscore_c + I(polityscore_c**2)', data=sub1).fit()

print (reg2.summary())

####################################################################################

# EVALUATING MODEL FIT

####################################################################################

# adding other variable

reg3 = smf.ols('lifeexpectancy ~ polityscore_c + I(polityscore_c**2) + hivrate_c + alcconsumption_c', data=sub1).fit()

print (reg3.summary())

#Q-Q plot for normality, need to added the plt fo

fig1=sm.qqplot(reg3.resid, line='r')

plt.show(fig1)

# simple plot of residuals

stdres=pandas.DataFrame(reg3.resid_pearson)

plt.plot(stdres, 'o', ls='None')

l = plt.axhline(y=0, color='r')

plt.ylabel('Standardized Residual')

plt.xlabel('Observation Number')

#have some extreme outliers, so evidence that model is poorly fair

# additional regression diagnostic plots

fig2 = plt.figure(figsize=(12,8))

fig2 = sm.graphics.plot_regress_exog(reg3, 'hivrate_c', fig=fig2)

plt.show(fig2)

# leverage plot

fig3=sm.graphics.influence_plot(reg3, size=8)

plt.show(fig3)

#nodindindance distractedbycatsprofile

0 notes

trsmit-blog · 8 years ago

Text

Basic Regression Time

For this exercise, I decide to test if the presence of residential electricity has an association with higher life expectancy. I had created this variable earlier in earlier exercise to have 0 as no residential electricity data and 1 as presence of electricity data. I would expect if there is enough residential electricity to track, that is an indicator of infrastructure that could improve quality of life.

When I did a basic regression, predicted life expectancy was estimated as:

Predicted life expectancy = 65.5 + 6.3*relectric

This implies that the presence of residential electricity is associated with 6.3 years increase, on average, in life expectancy, holding all else constant. The coefficient has a p-value less than 0.001 (t=4.36, n=191). So this does suggest the presence of residential electricity does have a statistically significant, positive association with life expectancy. This model did have a F-statistic of 19.05 (n=191, p=value<0.001), however the R-statistic was only 0.091. In other words, the explanatory variable only explains 9% of the variation in life expectancy.

OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.091 Model: OLS Adj. R-squared: 0.087 Method: Least Squares F-statistic: 19.01 Date: Sun, 19 Nov 2017 Prob (F-statistic): 2.13e-05 Time: 20:56:50 Log-Likelihood: -695.51 No. Observations: 191 AIC: 1395. Df Residuals: 189 BIC: 1402. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 65.4802 1.188 55.117 0.000 63.137 67.824 relectric 6.2786 1.440 4.360 0.000 3.438 9.119 ============================================================================== Omnibus: 16.588 Durbin-Watson: 1.456 Prob(Omnibus): 0.000 Jarque-Bera (JB): 18.275 Skew: -0.729 Prob(JB): 0.000108 Kurtosis: 2.586 Cond. No. 3.30 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Mean alcconsumption co2emissions hivrate lifeexpectancy \ relectric 0 26.499836 1.667740e+08 45.900984 65.480164 1 7.986538 7.619242e+09 14.158077 71.758715

polityscore relectricperperson urbanrate polityscoreg \ relectric 0 5.606557 -0.934426 43.392069 41.950820 1 4.484615 1213.235473 61.527846 6.892308

urbanratem relectric 0 0.275862 1 0.584615 Standard deviation alcconsumption co2emissions hivrate lifeexpectancy \ relectric 0 40.040633 8.069170e+08 47.922677 10.567628 1 9.564578 3.162933e+10 33.203256 8.613916

polityscore relectricperperson urbanrate polityscoreg \ relectric 0 6.159758 0.249590 23.351088 47.937955 1 6.478629 1703.060442 21.044561 20.352798

urbanratem relectric 0 0.450851 1 0.494695 Traceback (most recent call last):

File "<ipython-input-5-f58e05b93ef4>", line 1, in <module> runfile('C:/Users/Tofu/Python Project Folder/Basic Regression HW.py', wdir='C:/Users/Tofu/Python Project Folder')

File "C:\Users\Tofu\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile execfile(filename, namespace)

File "C:\Users\Tofu\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/Tofu/Python Project Folder/Basic Regression HW.py", line 41, in <module> plt.xlabel('Presence of Residential Electricity Data')

NameError: name 'plt' is not defined

the code:

@author: Tofu """

import numpy as numpy import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import seaborn

data = pandas.read_csv('gapminder2.csv', low_memory=False)

data['urbanratem'] = pandas.to_numeric(data['urbanratem'], errors='coerce') data['polityscoreg'] = pandas.to_numeric(data['polityscoreg'], errors='coerce') data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data['relectric'] = pandas.to_numeric(data['relectric'], errors='coerce')

sub1=data[(data['lifeexpectancy']>0)]

data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan) data['polityscoreg']=data['polityscoreg'].replace(99, numpy.nan) data['urbanratem']=data['urbanratem'].replace(' ',numpy.nan)

# impact of residential electricity reg1 = smf.ols('lifeexpectancy ~ relectric', data=sub1).fit() print (reg1.summary())

# group means & sd print ("Mean") ds1 = sub1.groupby('relectric').mean() print (ds1) print ("Standard deviation") ds2 = sub1.groupby('relectric').std() print (ds2)

# bivariate bar graph scat1 = seaborn.factorplot(x="relectric", y="lifeexpectancy", data=sub1, kind="bar", ci=None) plt.xlabel('Presence of Residential Electricity Data') plt.ylabel('Mean Life Expectancy') print (scat1)

#wroteafterdindindancehour

0 notes

trsmit-blog · 8 years ago

Text

Writing about the Data - Gapminder

Sample:

The Gapminder dataset looks at multiple variables for over 213 countries across the globe. The explanatory variable in my analysis is the polity score, which helps classifies a country’s political regime on a 21 point, from -10 to 10, scale to denotes relative levels of democracy and/or autocracy. The countries were broken down into democracies (polity score over 5, n=90, 42.5%), anocracies (polity score between -6 and 6, n=48, 22.5%), autocracies (polity score less than -5, n=23, 10.80%), and no classifications (no polity score, n=52). Life expectancies is the dependent variable in my analysis, though there were 22 countries without life expectancies. Though most countries without life expectancy were also without a polity score.

Procedures:

Polity score data is collected by researchers who assess an individual country regime in terms what how democratic and autocratic they were. The factors assess and weighted into democracy index were the how effective citizens can express preferences about alternative politics and leaders, existing institutional constraints on executive’s exercise of power, and the guarantee of civil liberties to all citizens in everyday life and political participation in a 11 point scale (0 to 10) Autocracies score weighed how competitive political participation is restricted or suppressed, how political elites regularized the selection of chief executive, and what institutional constraints are on exercising executive power in 11 point scale (-10 to 0). Since the 2000s, polity scores undergo regular annual updates and tests for assessor’s variances. The polity score reflects the 2009 assessment.

Life expectancies sources vary from The Human Mortality database, World Population Prospects, The Human Life-Table Database, and research of James C. Riley. While life expectancies across countries utilize various sources, they try to collect the following for each country:

- births (annual counts of live births by sex),

- deaths (counts with most details possible; if no data, estimated death counts by completed age),

- population size as of January 1st (if not, then derived from census data, births, and death),

- exposure to risk (estimates of the population exposed to the risk of death during some age-time interval for populations measured on January 1st)

- death rates: ratio of the death count for a given age-time interval divided by an estimate of the exposure to-risk in the same interval.

The citations at the bottoms have more information for those curious.

Measures

The life expectancies within the Gapminder dataset used can be considered as the expected number of years a new born child born in 2011 would live, assuming current mortality patterns. It is not hundred percent clear from exact source of data, the general takeaway is they try to utilize the available country census data. Polity score combines the democracy and autocracy score, ranging from -10 (fully autocratic) to 10 (fully democratic), using observed data on a country’s regime.

Life expectancy is a continuous variable while polity score is a categorical variable; both are surveillance data that incorporates multiples data sources into their respective variables. Both variables could be accessed by downloading the Gapminder dataset (www.gapminder.org) For our analysis, I did polity score to be either into two or three groups. If a tri grouping, it was broken into autocracies (-10 to -6), anocracies (-5 to 5), and democracies (6 to 10). If two groups, it would just be democracies (6 to 10) and non-democracies (-10 to 6).

Sources:

Center for Systemic Peace. “Polity IV Projects: Political Regime Characteristics and Transitions, 1800 – 2016 Dataset Users’ Manual.” http://www.systemicpeace.org/inscr/p4manualv2016.pdf

The Human Mortiality Database. “Overview”. http://www.mortality.org/Public/Overview.php

World Population Prospects. “Data Sources.pdf”.https://esa.un.org/unpd/wpp/Download/Other/Documentation/

The Human Life-Table Database, Shkolnikov, V.M. “Methodology Notes on the Human Life-Table Database (HLD)”. http://www.lifetable.de/methodology.pdf

#writtenwhilecatisnapping

0 notes

trsmit-blog · 8 years ago

Text

Urban Rate Moderator

For the assignment, I seek to answer if whether there is an association between higher than median or less than median urban rate (58%) when it comes to the relationship of life expectancy and polity score group.[I created two variables, urbanratem and polityscoreg, within excel]

As we saw in the first week of this course, there is a significant relationship between life expectancy and polity score groupings (autocracy, anocracy, and democracy). However, the difference was by anocracies (scores whose are between -5 and 5) compared to other groupings.

The F-statistic derived for the group with higher than median urban rate (58%) or above was 4. (df=2, 77 obs, p-value=0.022), which indicates a significant relationship with between polity score groups and life expectancy. We find this also in the countries with lower than median urban rate, though we yield a F-statistics of 7.995 (df=2, 83 obs, p-value= 0.00063).

The life average expectancy follows the same trend regardless of urban rate group (i.e. anocracies have lowest life expectancy on average), so it suggests that urban rate is not a moderator for life expectancy and polity score.

The Code:

@author: Tofu

"""

import numpy

import pandas

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

import seaborn

import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder2.csv', low_memory=False)

data['urbanratem'] = pandas.to_numeric(data['urbanratem'], errors='coerce')

data['polityscoreg'] = pandas.to_numeric(data['polityscoreg'], errors='coerce')

data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')

data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan)

data['polityscoreg']=data['polityscoreg'].replace(99, numpy.nan)

data['urbanratem']=data['urbanratem'].replace(' ',numpy.nan)

model1 = smf.ols(formula='lifeexpectancy ~ C(polityscoreg)', data=data).fit()

print (model1.summary())

sub1 = data[['polityscoreg', 'lifeexpectancy','urbanrate']].dropna()

print ("means for life expectancy by polity score group")

m1= sub1.groupby('polityscoreg').mean()

print (m1)

print ("standard deviation for mean life expectancy by polity score group")

st1= sub1.groupby('polityscoreg').std()

print (st1)

# bivariate bar graph

seaborn.factorplot(x="polityscoreg", y="lifeexpectancy", data=data, kind="bar", ci=None)

plt.xlabel('Polity Score')

plt.ylabel('Mean Life Expectancy')

sub2=data[(data['urbanratem']==1)]

sub3=data[(data['urbanratem']==0)]

print ('association between life expectancy and polity score for those in median or higher urban rates')

model2 = smf.ols(formula='lifeexpectancy ~ C(polityscoreg)', data=sub2).fit()

print (model2.summary())

print ('association between life expectancy and polity score for those with less than median urban rates')

model3 = smf.ols(formula='lifeexpectancy ~ C(polityscoreg)', data=sub3).fit()

print (model3.summary())

print ("means for life expectancy by polityscore for median or higher urban rates")

m3= sub2.groupby('polityscoreg').mean()

print (m3)

print ("Means for life expectancy by polityscore for less than median urban rates")

m4 = sub3.groupby('polityscoreg').mean()

print (m4)

-------The Results

runfile('C:/Users/Tofu/Python Project Folder/Week 8.py', wdir='C:/Users/Tofu/Python Project Folder')

OLS Regression Results

==============================================================================

Dep. Variable: lifeexpectancy R-squared: 0.215

Model: OLS Adj. R-squared: 0.205

Method: Least Squares F-statistic: 21.53

Date: Sun, 15 Oct 2017 Prob (F-statistic): 5.45e-09

Time: 18:38:43 Log-Likelihood: -576.38

No. Observations: 160 AIC: 1159.

Df Residuals: 157 BIC: 1168.

Df Model: 2

Covariance Type: nonrobust

==========================================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------------------

Intercept 70.7910 1.869 37.885 0.000 67.100 74.482

C(polityscoreg)[T.2.0] -9.2970 2.273 -4.091 0.000 -13.786 -4.808

C(polityscoreg)[T.3.0] 1.0312 2.096 0.492 0.623 -3.109 5.172

==============================================================================

Omnibus: 14.930 Durbin-Watson: 1.307

Prob(Omnibus): 0.001 Jarque-Bera (JB): 16.562

Skew: -0.781 Prob(JB): 0.000253

Kurtosis: 3.210 Cond. No. 5.70

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

means for life expectancy by polity score group

lifeexpectancy urbanrate

polityscoreg

1.0 70.791000 59.036522

2.0 61.493958 44.733750

3.0 71.822247 59.447416

standard deviation for mean life expectancy by polity score group

lifeexpectancy urbanrate

polityscoreg

1.0 6.585543 23.141218

2.0 8.988540 21.779372

3.0 9.448791 21.267754

association between life expectancy and polity score for those in median or higher urban rates

OLS Regression Results

==============================================================================

Dep. Variable: lifeexpectancy R-squared: 0.098

Model: OLS Adj. R-squared: 0.073

Method: Least Squares F-statistic: 4.012

Date: Sun, 15 Oct 2017 Prob (F-statistic): 0.0222

Time: 18:38:43 Log-Likelihood: -250.43

No. Observations: 77 AIC: 506.9

Df Residuals: 74 BIC: 513.9

Df Model: 2

Covariance Type: nonrobust

==========================================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------------------

Intercept 74.3214 1.924 38.631 0.000 70.488 78.155

C(polityscoreg)[T.2.0] -4.1269 2.664 -1.549 0.126 -9.434 1.180

C(polityscoreg)[T.3.0] 1.6142 2.111 0.765 0.447 -2.592 5.820

==============================================================================

Omnibus: 37.388 Durbin-Watson: 1.635

Prob(Omnibus): 0.000 Jarque-Bera (JB): 82.561

Skew: -1.728 Prob(JB): 1.18e-18

Kurtosis: 6.714 Cond. No. 6.15

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

association between life expectancy and polity score for those with less than median urban rates

OLS Regression Results

==============================================================================

Dep. Variable: lifeexpectancy R-squared: 0.167

Model: OLS Adj. R-squared: 0.146

Method: Least Squares F-statistic: 7.995

Date: Sun, 15 Oct 2017 Prob (F-statistic): 0.000683

Time: 18:38:43 Log-Likelihood: -294.00

No. Observations: 83 AIC: 594.0

Df Residuals: 80 BIC: 601.3

Df Model: 2

Covariance Type: nonrobust

==========================================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------------------

Intercept 67.5548 2.458 27.489 0.000 62.664 72.445

C(polityscoreg)[T.2.0] -8.9611 2.838 -3.158 0.002 -14.608 -3.314

C(polityscoreg)[T.3.0] -2.0789 2.848 -0.730 0.468 -7.746 3.588

==============================================================================

Omnibus: 4.296 Durbin-Watson: 1.741

Prob(Omnibus): 0.117 Jarque-Bera (JB): 3.341

Skew: -0.366 Prob(JB): 0.188

Kurtosis: 2.344 Cond. No. 5.56

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

means for life expectancy by polityscore for median or higher urban rates

alcconsumption co2emissions hivrate lifeexpectancy \

polityscoreg

1.0 11.950909 2.952225e+09 54.069091 74.321364

2.0 5.592500 3.133595e+09 25.821667 70.194500

3.0 11.589444 1.394451e+10 6.609444 75.935611

polityscore relectricperperson urbanrate relectric \

polityscoreg

1.0 -7.909091 3626.599374 79.294545 0.909091

2.0 0.416667 533.889590 75.501667 0.833333

3.0 8.981481 1505.305644 73.417407 0.944444

urbanratem

polityscoreg

1.0 1.0

2.0 1.0

3.0 1.0

Means for life expectancy by polityscore for less than median urban rates

alcconsumption co2emissions hivrate lifeexpectancy \

polityscoreg

1.0 4.959167 9.202947e+09 18.866667 67.554833

2.0 4.091111 4.410138e+08 13.361667 58.593778

3.0 6.211429 1.378032e+09 11.028571 65.475914

polityscore relectricperperson urbanrate relectric \

polityscoreg

1.0 -7.250000 311.065276 40.466667 0.833333

2.0 -0.138889 87.447582 34.477778 0.555556

3.0 7.800000 336.195945 37.893714 0.657143

urbanratem

polityscoreg

1.0 0.0

2.0 0.0

3.0 0.0

0 notes

trsmit-blog · 8 years ago

Text

Pearson Correlation Test

I decided to compare the association between urban rate and life expectancy. My coding results are below and above. The scatter plot show a general positive linear relationship between life expectancy and urban rate (when we excluded missing values on each variable). However, there is lots of variance from the predicted line. The Pearson correlation yields a r value of .618 between urban rate and life expectancy with a p value of less than 0.01. This does suggest a statistically significant, positive linear relationship between urban rate and life expectancy. Squaring the r value, we derive a r-squared value of 0.38; that means If we know the urban rate, we can predict the 38% of the variability we’ll see in life expectancy. The other 62% of variability is unaccounted for.

The coding and results:

import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder1.csv', low_memory=False)

# new code setting variables you will be working with to numeric data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce') data['relectric'] = pandas.to_numeric(data['relectric'], errors='coerce') data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce') data['relectricperperson']= pandas.to_numeric(data['relectricperperson'],errors='coerce') data['co2emissions'] = pandas.to_numeric(data['co2emissions'], errors='coerce') data['hivrate'] = pandas.to_numeric(data['hivrate'], errors='coerce') data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')

data['urbanrate']=data['urbanrate'].replace(' ', numpy.nan)

data['lifeexpectancy']=data['lifeexpectancy'].replace(0, numpy.nan)

scat1 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Life Expectancy') plt.title('Scatterplot for the Association Between Urban Rate and Life Expectancy')

data_clean=data.dropna()

print ('association between urbanrate and life expectancy') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))

------------------

association between urbanrate and life expectancy

(0.61870710462448963, 3.028137554378938e-21)

#saturdaynightcoding nocaterruptions

0 notes

trsmit-blog · 8 years ago

Text

Chi Squared Test of Independence

When examining the relationship between polity score (categorical explanation) and the presence of residential electricity consumption data (categorical response), a chi-test of independence revealed that among counties with polity scores (my sample), those that are considered democratic were NOT more likely to have residential electricity data (83%) versus those countries who are not categorized as democratic (70%), X2=8.72, 2 df, p-value= 0.06)

I did not conduct post hoc comparisons. Though it should be noted I attempted to look at groupings as autocracy (-6 or less), anocracy (-5 to 5), democracy (6 to 10) in round 2 and found some statistical significant association between polity score group and presence of residential electricity data[X2=9.17, 2 df, p-value= 0.01]. However, my coding did not worked to allow post hoc comparison. I tried a workaround by creating “polityscoreg” in Excel with the following logic

If Polity score is equal to or less =6, code 1 (autocracy), if not

If polity score is equal to or less than 5, code 2 (anocracies), if not

If polity score is equal to or less than 10, code 3 (democracy))

However something went wrong with my coding and I didn’t try too hard to fix because I was more focus on democracy or not for the exercise. The error was

Using what I know, I would expect it would the anocracies group to be the source of significant difference, which I found in my previous post that anocracies have statistically significant expected life expectancies from either autocracies or democracies. I will try around another day.

the code (week 6, NOT week6 alt view):

""" Created on Sun Oct 1 17:19:04 2017

@author: Tofu """

import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder1.csv', low_memory=False)

#exploring life expectancy data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan)

#making subsite of complete data sub7= data[(data["polityscore"]<11)]

sub8= sub7.copy()

#dividing data set into groups sub8["polityscore"] = pandas.cut(sub8.polityscore, [-10, 5, 10])

#contingency table of observed counts ct1=pandas.crosstab(sub8['relectric'], sub8['polityscore']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

#chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

# set variable types sub8['polityscore']=sub8['polityscore'].astype('category') # converting to numeric sub8['relectric'] = pandas.to_numeric(sub8['relectric'], errors='coerce')

# graph percent with seaborn.factorplot(x='polityscore', y='relectric', data=sub8, kind='bar', ci=None) plt.xlabel('polity score group') plt.ylabel('proportion with residential electricity')

---------------------------the results--------------------------------

runfile('C:/Users/Tofu/Python Project Folder/Week 6.py', wdir='C:/Users/Tofu/Python Project Folder') polityscore (-10, 5] (5, 10] relectric 0 21 15 1 48 75 polityscore (-10, 5] (5, 10] relectric 0 0.304348 0.166667 1 0.695652 0.833333 chi-square value, p value, expected counts (3.4774549801657448, 0.062210347893879261, 1, array([[ 15.62264151, 20.37735849], [ 53.37735849, 69.62264151]]))

#mycatisplayinghockey

0 notes

trsmit-blog · 8 years ago

Text

Testing Life Expectancy and Government Type

Finally testing out significance! Woot!

The code:

# -*- coding: utf-8 -*- """ Created on Sun Sep 24 17:12:04 2017

@author: Tofu """

import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

data = pandas.read_csv('gapminder1.csv', low_memory=False)

#converting variables to numeric values data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True) data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True) data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)

#exploring life expectancy data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan)

#making subsite of complete data sub5= data[(data["polityscore"]<11) & (data["lifeexpectancy"]!=0)]

sub6= sub5.copy(5)

#dividing data set into groups sub6["polityscore"] = pandas.cut(sub6.polityscore, [-10, -6, 5, 10])

sub6["polityscore"]=sub6["polityscore"].astype("category") #setting up the model if it was binary category

#sub2 = sub6[["polityscore", "lifeexpectancy"]].dropna()

#model1 = smf.ols(formula='lifeexpectancy ~ C(polityscore)',data=sub5) #results1= model1.fit() #print (results1.summary())

#checking the means if it was binary category #print ("means of lifeexpectancy by polity score group") #m1=sub2.groupby("polityscore").mean() #print (m1)

#checking the standard deviatio if was a binary category #print ("standard dev of lifeexpectancy by polity score group) #s1=sub2.groupby("polityscore").std() #print (s1)

sub3 = sub6[["polityscore", "lifeexpectancy"]].dropna()

model2= smf.ols(formula= 'lifeexpectancy ~ C(polityscore)', data=sub3).fit() print (model2.summary())

#running means by polity score print ("means of lifeexpectancy by polity score groups") m2=sub3.groupby('polityscore').mean() print (m2)

print ("standard devs of lifeexpectancy by polity score groups") s2=sub3.groupby('polityscore').std() print (s2)

#running post hoc test mc1= multi.MultiComparison(sub3['lifeexpectancy'], sub3['polityscore']) res1 =mc1.tukeyhsd() print (res1.summary())

The results:

OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.214 Model: OLS Adj. R-squared: 0.203 Method: Least Squares F-statistic: 21.05 Date: Sun, 24 Sep 2017 Prob (F-statistic): 8.17e-09 Time: 19:35:48 Log-Likelihood: -569.72 No. Observations: 158 AIC: 1145. Df Residuals: 155 BIC: 1155. Df Model: 2 Covariance Type: nonrobust ===================================================================================================================== coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------------------------------------- Intercept 70.2815 1.962 35.814 0.000 66.405 74.158 C(polityscore)[T.Interval(-6, 5, closed='right')] -8.7875 2.353 -3.735 0.000 -13.435 -4.140 C(polityscore)[T.Interval(5, 10, closed='right')] 1.5408 2.182 0.706 0.481 -2.769 5.850 ============================================================================== Omnibus: 14.489 Durbin-Watson: 0.989 Prob(Omnibus): 0.001 Jarque-Bera (JB): 16.036 Skew: -0.775 Prob(JB): 0.000330 Kurtosis: 3.183 Cond. No. 5.92 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means of lifeexpectancy by polity score groups lifeexpectancy polityscore (-10, -6] 70.281476 (-6, 5] 61.493958 (5, 10] 71.822247 standard devs of lifeexpectancy by polity score groups lifeexpectancy polityscore (-10, -6] 6.638839 (-6, 5] 8.988540 (5, 10] 9.448791 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ================================================== group1 group2 meandiff lower upper reject -------------------------------------------------- (-10, -6] (-6, 5] -8.7875 -14.3559 -3.2192 True (-10, -6] (5, 10] 1.5408 -3.6225 6.704 False (-6, 5] (5, 10] 10.3283 6.517 14.1396 True --------------------------------------------------

Model Interpretation for ANOVA:

When examining the association between life expectancy (the quantitative response) and polity score group (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among different types of governments in the Gap Minder data set, average life expectancy is significantly amongst the different types of government. For example, autocracies have a life expectancy mean of 70.3 years (s.d ± 6.6 years), while anocracies have a life expectancy mean of 61.5 years (s.d. ± 9 years) and democracies have a life expectancy mean of 71.82 (s.d. ± 9.4 years). The ANOVA yield a F-statistic of 21.05 with 2 degrees of freedom and 158 observations, which has a p-value of 0.000000000817.

Note that the degrees of freedom that I report in parentheses) following ‘F’ can be found in the OLS table as the DF model and DF residuals.

I did do alternative view to just compare democracies versus non-democracies. And also found evidence to reject the null hypothesis that there is no association of democracy and life expectancy. Non democracies have a life expectancy mean of 64.2 years (s.d. ± 9.2 years) while democracies have average life expectancy of 71.8 years (s.d. ± 9.4 years). The resulting F statistic was 26 (1, 156) and this F-statistic has a p-value of 0.000000098. [These results aren’t shown, but can provide code and results]

Model Interpretation for post hoc ANOVA results:

ANOVA revealed that among countries within the Gapminder dataset, different government types (group as either autocracies, anocracies, or democracies, which is the categorical explanatory variable) and the life expectancy (the quantitative response variable) were significantly associated, F (2, 155) = 21.05, p=0.000000000817. Post hoc comparison of mean life expectancy by polity score group reveal that anocracies’ average life expectancy are significantly different from both autocracies and democracies. However, autocracies, and democracies don’t have sufficient evidence to suggest that life expectancy significantly differ by living in either of one.

#savedbyprogramming codingthroughthefeelings

0 notes

trsmit-blog · 8 years ago

Text

Visualize This

This week’s homework went overall well, outside of not being able to do a histogram for life expectancy. From my previous review, I know that the distribution had little to no modes, so not visualizing does not affect me.

The code:

import pandas import numpy import seaborn import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder1.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)

#bug fix for display formats to avoid run time errors - put after loading data above pandas.set_option('display.float_format', lambda x: '%f' %x)

#humber of observation and variables print(len(data)) print(len(data.columns))

print ("counts of polityscore - democracy score in 2009, numeric") c1 = data["polityscore"].value_counts(sort=False) print (c1)

print ("percentages of polityscore - democracy score in 2009, numeric") p1 = data["polityscore"].value_counts(sort=False, normalize=True) print (p1)

#exploring life expectancy data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan)

desc2=data["lifeexpectancy"].describe() print (desc2)

#making a subset of lifeexpectancy to exclude missing values sub3=data[(data["lifeexpectancy"]!=0)]

sub4=sub3.copy()

#Univariate histogram for quantitative variable: sub4["polityscore"] = sub4["polityscore"].convert_objects(convert_numeric=True)

seaborn.distplot(sub4["lifeexpectancy"].dropna(), kde=False); plt.xlabel("Life Expectancy, years") plt.title("Estimated Life Expectancy in Countries Within Gapminder Dataset")

#make subset of all polity scores, adjusting out for null values in polity scores and lifeexpectancy sub5= data[(data["polityscore"]<11) & (data["lifeexpectancy"]!=0)]

sub6= sub5.copy(5)

#recoding them into groups sub6["polityscore"] = pandas.cut(sub6.polityscore, [-10, -6, 5, 10])

print(" count of polity scores, without missing value") c7=sub6["polityscore"].value_counts(sort=False) print (c7)

#univariate bar graph for categorical variables # First hange format from numeric to categorical sub6["polityscore"]=sub6["polityscore"].astype("category")

seaborn.countplot(x="polityscore", data=sub6) plt.xlabel("polity score") plt.title("Polity Score in Gapminder Dataset, excluding missing values")

print ("Describe polityscore, group by govt type") desc1= sub6["polityscore"].describe() print (desc1)

print ("counts of polity score, group by govt type") c8=sub6["polityscore"].value_counts(sort=False) print (c8)

# second create a new variable (PACKCAT) that has the new variable value labels sub6["polityscore"]=sub6["polityscore"].cat.rename_categories(["Autocracy","Anocracy","Democracy"])

seaborn.countplot(x="polityscore", data=sub6) plt.xlabel("polity score") plt.title("Polity Score in Gapminder Dataset, excluding missing values")

# bivariate bar graph C->Q seaborn.factorplot(x="polityscore", y="lifeexpectancy", data=sub6, kind="bar", ci=None) plt.xlabel("Polity Score") plt.ylabel("Life Expectancy, years")

output and summary:

count 191.000000 mean 69.753524 std 9.708621 min 47.794000 25% 64.447000 50% 73.131000 75% 76.593000 max 83.394000

Life expectancy, without missing values, ranges from 47.8 years to 83.4 years. The (mean) average is 69.8 years, plus or minus 9.7 years. The median age is 73 years.

This histogram generated in Excel shows that the there is relatively symmetric distribution if we don’t group life expectancy values. Two modes appear approximately at 72.9 and 74 years.

Polity score are categorical in nature. Countries with observed polity scores are given a integarish score from -10 to 10. Autocracy are given a score of -6 or lower, anocracies start at -5 and end at 5, while democracies have a score of 6 or greater.

Name: lifeexpectancy, dtype: float64 count of polity scores, without missing value (-10, -6] 21 (-6, 5] 48 (5, 10] 90 Name: polityscore, dtype: int64 Describe polityscore, group by govt type count 159 unique 3 top (5, 10] freq 90

Democracies are the most common type of goverment with the Gapminder Dataset, being observed 90 of 159 times for counties with values.

When incorporating life expectancy with polity score groups, we see that there is a difference between them:

The average life expectancy for autocracies is close to 70 years while average life expectancy in democracies is 72 years. More surprising, life expectancy is the lowest in anocracy countries at 62 years (approx).

My original hypothesis was that democracies will have higher life expectancy, which is shown with the highest average life expectancy, but autocracies also have a comparable average life expectancy. Further testing would be needed to control for other factors and really see if this hypothesis is true for this dataset.

#writtenwithagrumpycatinmylap

0 notes

trsmit-blog · 8 years ago

Text

Week 3 - Munging Around

Initial thoughts:

The Gapminder dataset didn’t use numeric values to denote missing values, so the first lesson to use numpy to convert the “missing value” indicator to be “nan” using numpy wasn’t my option as demonstrated. Additionally, this dataset is compiled from multiple other datasets and not from one survey, so there aren’t any traditionally skip jump questions. Lesson 3 thoughts: I also opted out of creating secondary variables that would be some arithmetic operations of other variables. Could have looked at carbon dioxide emissions per residential electricity use, but I don’t think that it would be make sense because one is a cumulative value over time while the other variable is electricity use in one particular year, and a portion of cumulative emissions were created outside of electricity use. However, the wide range of data values for life expectancy and other variables provide opportunity to recode them into groups.

--well my initial plans didn’t work as expected, so I cleaned the data in Excel to deal with missing values before any further management in Python. This was only an option because of the smaller size of the dataset. So I welcome any feedback on how to do it within Python.

Polityscore: replaced missing values with 11. Then I should have to adjust my subgroup for democracy to be between 6 and 11.

Lifeexpectancy: replaced missing or null values with 0.

Relectricperperson: replaced missing value with -1.

Hivrate: replaced missing values with 99.

Co2emissions: replaced missing values with 0.

Alcconsumption: replaced missing values with 99

Then saved as gapminder1.csv -------------------------------------------------------

The code

# -*- coding: utf-8 -*-

"""

Created on Sun Aug 27 12:53:43 2017

@author: Tofu

"""

import pandas

import numpy

data = pandas.read_csv('gapminder1.csv', low_memory=False)

#bug fix for display formats to avoid run time errors - put after loading data above

pandas.set_option('display.float_format', lambda x: '%f' %x)

#humber of observation and variables

print(len(data))

print(len(data.columns))

#converting variables to numeric values

data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)

data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)

data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)

data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)

data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True)

data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)

print ("counts of polityscore - democracy score in 2009, numeric")

c1 = data["polityscore"].value_counts(sort=False)

print (c1)

print ("percentages of polityscore - democracy score in 2009, numeric")

p1 = data["polityscore"].value_counts(sort=False, normalize=True)

print (p1)

#making a subset of democracies, adjusting for null value in polityscore and lifeexpectancy variables

sub1= data[(data["polityscore"]>= 6) & (data["polityscore"]<11) & (data["lifeexpectancy"])!=0]

#make a copy of my democracy subset

sub2= sub1.copy()

# frequency distributions of democracy

print ("counts for democracy - polity score 6 or greater, positve life expectancy")

c7 = sub2["polityscore"].value_counts(sort=False)

print (c7)

print ("percentages of democratic govs - polity score 6 or greater, positive life expectancy")

p7=sub2["polityscore"].value_counts(sort=False, normalize=True)

print (p7)

#making a subset of lifeexpectancy to exclude missing values

sub3=data[(data["lifeexpectancy"]!=0)]

sub4=sub3.copy()

#checking my subset of lifeexpectancy

print("counts for lifeexpectancy - without missing values")

c8=sub4["lifeexpectancy"].value_counts(sort=False)

print (c8)

print("counts for lifeexpectancy - with missing values")

c2=data["lifeexpectancy"].value_counts(sort=False)

print (c2)

#quartiles of life in adjusted lifeexpectancy

print ("life expectancy, 4 categories or quartiles, all polity scores")

sub4["lifeexpectancy4"]=pandas.cut(sub4.lifeexpectancy, 4, labels=["1=25%tile", "2=50%tile","3=75%tile", "4=%100tile"])

c4=sub4["lifeexpectancy4"].value_counts(sort=False, dropna=True)

print (c4)

#cross tabs of life expectancy quartiles

print (pandas.crosstab(sub4["lifeexpectancy4"], sub4["lifeexpectancy"]))

#create own lifeexpectancy groups in adjusted subgroup lifeexpectancy for 10 year grouping

sub4["lifeexpectancy5"]=pandas.cut(sub4.lifeexpectancy, [40,50,60,70,80,90])

c5=sub4["lifeexpectancy5"].value_counts(sort=False)

print(c5)

#crosstab of life expectancy quartile with 10 year groupings

print (pandas.crosstab(sub4["lifeexpectancy4"], sub4["lifeexpectancy5"]))

#crosstab of life expectancy 10 year groupings with basic lifeexpectancy

print (pandas.crosstab(sub4["lifeexpectancy5"], sub4["lifeexpectancy"]))

#counts and distribution of democracy in lifeexpectancy

print ("counts of polityscore in lifeexpectancy")

c10 = sub4["polityscore"].value_counts(sort=False)

print (c10)

print ("percentages of polityscore in life expectancy")

p10=sub4["polityscore"].value_counts(sort=False, normalize=True)

print (p10)

#16% of positive life expectancy observations are missing polity scores, so adjust polityscore in lifeexpectancy subset to exclude missing value 11

sub4["polityscore"]=sub4["polityscore"].replace(11, numpy.nan)

print ("counts of polityscore in positive life expectancy,post coding 11 to be nan")

c11=sub4["polityscore"].value_counts(sort=False, dropna=False)

print (c11)

print ("percentages of polity score in positive life expectancy, post coding 11 to be nan")

p11=sub4["polityscore"].value_counts(sort=False, normalize=True, dropna=False)

print (p11)

print ("counts of polityscore in positive lifeexpectancy, no nan")

c12=sub4["polityscore"].value_counts(sort=False, dropna=True)

print (c12)

print ("percentages of polity sore in positive lifeexpectancy, no nan")

p12=sub4["polityscore"].value_counts(sort=False, normalize=True, dropna=True)

print (p12)

#crosstabs of lifeexpectancy, 10 year groups and polity scores

print ("crosstab of positive lifeexpectancy, 10 yr grouping, and polity score, lifeexpectancy as columns")

print (pandas.crosstab(sub4["polityscore"], sub4["lifeexpectancy5"]))

print ("crosstab of polity score and postive lifeexpectancy, 10 yr grouping, polity score as columns")

print (pandas.crosstab(sub4["lifeexpectancy5"],sub4["polityscore"]))

#crosstabs of lifeexpectacncy, 10 year grouping, within democracies

print ("crosstabs of lifeexpectancy 1o yr grouping and democracies, view with lifeexpectancy as columns")

print (pandas.crosstab(sub2["polityscore"], sub4["lifeexpectancy5"]))

print ("crosstabs of lifeexpectancy 10 yr grouping and democracies, view with lifeexpectancy as rows")

print (pandas.crosstab(sub4["lifeexpectancy5"], sub2["polityscore"]))

The output (truncated after some intitial code

-9 4

-10 2

Name: polityscore, dtype: int64

percentages of polityscore - democracy score in 2009, numeric

0 0.028169

-2 0.023474

-3 0.028169

-4 0.028169

4 0.018779

5 0.032864

6 0.046948

7 0.061033

8 0.089202

9 0.070423

10 0.154930

11 0.244131

-1 0.018779

2 0.014085

3 0.009390

1 0.014085

-6 0.014085

-7 0.056338

-8 0.009390

-5 0.009390

-9 0.018779

-10 0.009390

Name: polityscore, dtype: float64

counts for democracy - polity score 6 or greater, positve life expectancy

6 10

7 13

8 19

9 15

10 32

Name: polityscore, dtype: int64

percentages of democratic govs - polity score 6 or greater, positive life expectancy

6 0.112360

7 0.146067

8 0.213483

9 0.168539

10 0.359551

Name: polityscore, dtype: float64

counts for lifeexpectancy - without missing values

63.125000 1

79.341000 1

49.553000 1

68.795000 1

58.582000 1

79.977000 1

58.199000 1

80.170000 1

81.012000 1

74.573000 1

70.124000 1

70.563000 1

76.954000 1

48.398000 1

68.944000 1

75.181000 1

81.126000 1

75.956000 1

69.317000 1

65.193000 1

80.557000 1

67.185000 1

73.990000 1

75.901000 1

82.759000 1

79.499000 1

61.597000 1

79.158000 1

71.017000 1

76.546000 1

73.703000 1

67.714000 1

51.093000 1

71.172000 1

73.456000 1

74.044000 1

78.005000 1

78.371000 1

76.640000 1

74.788000 1

76.652000 1

82.338000 1

80.499000 1

74.414000 1

75.670000 1

67.017000 1

61.452000 1

68.498000 1

73.127000 1

74.156000 1

75.620000 1

62.791000 1

72.832000 1

62.703000 1

68.749000 1

76.126000 1

81.539000 1

54.210000 1

57.379000 1

73.373000 1

Name: lifeexpectancy, Length: 189, dtype: int64

counts for lifeexpectancy - with missing values

63.125000 1

79.341000 1

0.000000 22

49.553000 1

68.795000 1

58.582000 1

79.977000 1

58.199000 1

80.170000 1

81.012000 1

74.573000 1

70.124000 1

70.563000 1

76.954000 1

48.398000 1

68.944000 1

75.181000 1

81.126000 1

75.956000 1

69.317000 1

65.193000 1

80.557000 1

67.185000 1

73.990000 1

75.901000 1

82.759000 1

79.499000 1

61.597000 1

79.158000 1

71.017000 1

73.703000 1

67.714000 1

51.093000 1

71.172000 1

73.456000 1

74.044000 1

78.005000 1

78.371000 1

76.640000 1

74.788000 1

76.652000 1

82.338000 1

80.499000 1

74.414000 1

75.670000 1

67.017000 1

61.452000 1

68.498000 1

73.127000 1

74.156000 1

75.620000 1

62.791000 1

72.832000 1

62.703000 1

68.749000 1

76.126000 1

81.539000 1

54.210000 1

57.379000 1

73.373000 1

Name: lifeexpectancy, Length: 190, dtype: int64

life expectancy, 4 categories or quartiles, all polity scores

1=25%tile 28

2=50%tile 26

3=75%tile 63

4=%100tile 74

Name: lifeexpectancy4, dtype: int64

lifeexpectancy 47.794000 48.132000 48.196000 48.397000 48.398000 \

lifeexpectancy4

1=25%tile 1 1 1 1 1

2=50%tile 0 0 0 0 0

3=75%tile 0 0 0 0 0

4=%100tile 0 0 0 0 0

lifeexpectancy 48.673000 48.718000 49.025000 49.553000 50.239000 \

lifeexpectancy4

1=25%tile 1 1 1 1 1

2=50%tile 0 0 0 0 0

3=75%tile 0 0 0 0 0

4=%100tile 0 0 0 0 0

lifeexpectancy ... 81.404000 81.439000 81.539000 81.618000 \

lifeexpectancy4 ...

1=25%tile ... 0 0 0 0

2=50%tile ... 0 0 0 0

3=75%tile ... 0 0 0 0

4=%100tile ... 1 1 1 1

lifeexpectancy 81.804000 81.855000 81.907000 82.338000 82.759000 \

lifeexpectancy4

1=25%tile 0 0 0 0 0

2=50%tile 0 0 0 0 0

3=75%tile 0 0 0 0 0

4=%100tile 1 1 1 1 1

lifeexpectancy 83.394000

lifeexpectancy4

1=25%tile 0

2=50%tile 0

3=75%tile 0

4=%100tile 1

[4 rows x 189 columns]

(40, 50] 9

(50, 60] 29

(60, 70] 38

(70, 80] 92

(80, 90] 23

Name: lifeexpectancy5, dtype: int64

lifeexpectancy5 (40, 50] (50, 60] (60, 70] (70, 80] (80, 90]

lifeexpectancy4

1=25%tile 9 19 0 0 0

2=50%tile 0 10 16 0 0

3=75%tile 0 0 22 41 0

4=%100tile 0 0 0 51 23

lifeexpectancy 47.794000 48.132000 48.196000 48.397000 48.398000 \

lifeexpectancy5

(40, 50] 1 1 1 1 1

(50, 60] 0 0 0 0 0

(60, 70] 0 0 0 0 0

(70, 80] 0 0 0 0 0

(80, 90] 0 0 0 0 0

lifeexpectancy 48.673000 48.718000 49.025000 49.553000 50.239000 \

lifeexpectancy5

(40, 50] 1 1 1 1 0

(50, 60] 0 0 0 0 1

(60, 70] 0 0 0 0 0

(70, 80] 0 0 0 0 0

(80, 90] 0 0 0 0 0

lifeexpectancy ... 81.404000 81.439000 81.539000 81.618000 \

lifeexpectancy5 ...

(40, 50] ... 0 0 0 0

(50, 60] ... 0 0 0 0

(60, 70] ... 0 0 0 0

(70, 80] ... 0 0 0 0

(80, 90] ... 1 1 1 1

lifeexpectancy 81.804000 81.855000 81.907000 82.338000 82.759000 \

lifeexpectancy5

(40, 50] 0 0 0 0 0

(50, 60] 0 0 0 0 0

(60, 70] 0 0 0 0 0

(70, 80] 0 0 0 0 0

(80, 90] 1 1 1 1 1

lifeexpectancy 83.394000

lifeexpectancy5

(40, 50] 0

(50, 60] 0

(60, 70] 0

(70, 80] 0

(80, 90] 1

[5 rows x 189 columns]

counts of polityscore in lifeexpectancy

0 6

-2 5

-3 6

-4 6

4 4

5 7

6 10

7 13

8 19

9 15

10 32

11 31

-1 4

2 3

3 2

1 3

-6 3

-7 12

-8 2

-5 2

-9 4

-10 2

Name: polityscore, dtype: int64

percentages of polityscore in life expectancy

0 0.031414

-2 0.026178

-3 0.031414

-4 0.031414

4 0.020942

5 0.036649

6 0.052356

7 0.068063

8 0.099476

9 0.078534

10 0.167539

11 0.162304

-1 0.020942

2 0.015707

3 0.010471

1 0.015707

-6 0.015707

-7 0.062827

-8 0.010471

-5 0.010471

-9 0.020942

-10 0.010471

Name: polityscore, dtype: float64

counts of polityscore in positive life expectancy,post coding 11 to be nan

10.000000 32

8.000000 19

5.000000 7

-3.000000 6

7.000000 13

-4.000000 6

6.000000 10

9.000000 15

nan 31

-10.000000 2

-6.000000 3

-9.000000 4

-7.000000 12

-8.000000 2

2.000000 3

-2.000000 5

0.000000 6

3.000000 2

1.000000 3

4.000000 4

-1.000000 4

-5.000000 2

Name: polityscore, dtype: int64

percentages of polity score in positive life expectancy, post coding 11 to be nan

10.000000 0.167539

8.000000 0.099476

5.000000 0.036649

-3.000000 0.031414

7.000000 0.068063

-4.000000 0.031414

6.000000 0.052356

9.000000 0.078534

nan 0.162304

-10.000000 0.010471

-6.000000 0.015707

-9.000000 0.020942

-7.000000 0.062827

-8.000000 0.010471

2.000000 0.015707

-2.000000 0.026178

0.000000 0.031414

3.000000 0.010471

1.000000 0.015707

4.000000 0.020942

-1.000000 0.020942

-5.000000 0.010471

Name: polityscore, dtype: float64

counts of polityscore in positive lifeexpectancy, no nan

10.000000 32

8.000000 19

5.000000 7

-3.000000 6

7.000000 13

-4.000000 6

6.000000 10

9.000000 15

-10.000000 2

-6.000000 3

-9.000000 4

-7.000000 12

-8.000000 2

2.000000 3

-2.000000 5

0.000000 6

3.000000 2

1.000000 3

4.000000 4

-1.000000 4

-5.000000 2

Name: polityscore, dtype: int64

percentages of polity sore in positive lifeexpectancy, no nan

10.000000 0.200000

8.000000 0.118750

5.000000 0.043750

-3.000000 0.037500

7.000000 0.081250

-4.000000 0.037500

6.000000 0.062500

9.000000 0.093750

-10.000000 0.012500

-6.000000 0.018750

-9.000000 0.025000

-7.000000 0.075000

-8.000000 0.012500

2.000000 0.018750

-2.000000 0.031250

0.000000 0.037500

3.000000 0.012500

1.000000 0.018750

4.000000 0.025000

-1.000000 0.025000

-5.000000 0.012500

Name: polityscore, dtype: float64

crosstab of positive lifeexpectancy, 10 yr grouping, and polity score, lifeexpectancy as columns

lifeexpectancy5 (40, 50] (50, 60] (60, 70] (70, 80] (80, 90]

polityscore

-10.000000 0 0 0 2 0

-9.000000 1 0 3 0 0

-8.000000 0 0 0 2 0

-7.000000 0 0 2 10 0

-6.000000 0 0 2 1 0

-5.000000 0 2 0 0 0

-4.000000 0 3 2 1 0

-3.000000 0 2 1 3 0

-2.000000 1 2 1 0 1

-1.000000 1 3 0 0 0

0.000000 1 3 2 0 0

1.000000 0 2 1 0 0

2.000000 0 1 1 1 0

3.000000 0 0 2 0 0

4.000000 0 1 2 1 0

5.000000 1 1 3 2 0

6.000000 1 3 3 3 0

7.000000 2 4 3 4 0

8.000000 1 1 5 10 2

9.000000 0 1 2 11 1

10.000000 0 0 1 16 15

crosstab of polity score and postive lifeexpectancy, 10 yr grouping, polity score as columns

polityscore -10.000000 -9.000000 -8.000000 -7.000000 -6.000000 \

lifeexpectancy5

(40, 50] 0 1 0 0 0

(50, 60] 0 0 0 0 0

(60, 70] 0 3 0 2 2

(70, 80] 2 0 2 10 1

(80, 90] 0 0 0 0 0

polityscore -5.000000 -4.000000 -3.000000 -2.000000 -1.000000 \

lifeexpectancy5

(40, 50] 0 0 0 1 1

(50, 60] 2 3 2 2 3

(60, 70] 0 2 1 1 0

(70, 80] 0 1 3 0 0

(80, 90] 0 0 0 1 0

polityscore ... 1.000000 2.000000 3.000000 4.000000 \

lifeexpectancy5 ...

(40, 50] ... 0 0 0 0

(50, 60] ... 2 1 0 1

(60, 70] ... 1 1 2 2

(70, 80] ... 0 1 0 1

(80, 90] ... 0 0 0 0

polityscore 5.000000 6.000000 7.000000 8.000000 9.000000 \

lifeexpectancy5

(40, 50] 1 1 2 1 0

(50, 60] 1 3 4 1 1

(60, 70] 3 3 3 5 2

(70, 80] 2 3 4 10 11

(80, 90] 0 0 0 2 1

polityscore 10.000000

lifeexpectancy5

(40, 50] 0

(50, 60] 0

(60, 70] 1

(70, 80] 16

(80, 90] 15

[5 rows x 21 columns]

crosstabs of lifeexpectancy 1o yr grouping and democracies, view with lifeexpectancy as columns

lifeexpectancy5 (40, 50] (50, 60] (60, 70] (70, 80] (80, 90]

polityscore

6.000000 1 3 3 3 0

7.000000 2 4 3 4 0

8.000000 1 1 5 10 2

9.000000 0 1 2 11 1

10.000000 0 0 1 16 15

crosstabs of lifeexpectancy 10 yr grouping and democracies, view with lifeexpectancy as rows

polityscore 6.000000 7.000000 8.000000 9.000000 10.000000

lifeexpectancy5

(40, 50] 1 2 1 0 0

(50, 60] 3 4 1 1 0

(60, 70] 3 3 5 2 1

(70, 80] 3 4 10 11 16

(80, 90] 0 0 2 1 15

C:/Users/Tofu/Python Project Folder/Week 3.py:21: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)

C:/Users/Tofu/Python Project Folder/Week 3.py:22: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)

C:/Users/Tofu/Python Project Folder/Week 3.py:23: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)

C:/Users/Tofu/Python Project Folder/Week 3.py:24: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)

C:/Users/Tofu/Python Project Folder/Week 3.py:25: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True)

C:/Users/Tofu/Python Project Folder/Week 3.py:26: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)

So I ended filtering out the null values in my subsets. My first subset is to look section out democracies form the Gapminder group. I also added additional filter in democracies to exclude null values of life expectancy (aka any life expectancy value not equal to zero). Before filtering missing lifeexpectancy values, there were 90 observations for democratic governments; after, there were 89 observations remaining. Of those 89 observations, 36% were coded with a polity score of 10. When I poke in my excel version (just so that I can explore).

I first initially created a subset of lifeexpectancy values for any valid values (aka, not equal to zero). However, we still have a huge range of values. So I first try the quartering the data set, which divide life expectancy to have 28 of 191 in 25th percentile, etc. However, even with a cross tab, this was not easily understood to me. So I decide to grouped it myself

Using excel, I found the life expectancy values start in the early 40s and end in mid 80’s. Initially, I tried doing 5 year groups, starting at 40, but I received an error that I could only at most 9 distinct groupings. So I decided to break up the data into 10 years groupings instead. 92 observations were between 70 and 80 years old, making up 48% of datasets regardless of polity score

When I crosstabed the life expectancy 10 year groupings and quartiles, I found 50 to 70’s to be in 50th percentile range. In other words, the median life expectancy in observed countries is between 50 and 70 years old.

However, the life expectancy subset has not removed missing polity scores. When I look at like at positive life expectancy and all polity scores, there are 31 observations or 16% with null values (as indicated with “11”) . So I converted the “11” polity scores in life expectancy subset to “nan“. When I exclude the “nan”, the observations dropped. This illustrative by the percentages of countries scoring 10 on the polity score: without null values, 20% of non-null life expectancies observations score as a 10 (32 observations) versus 17% of observation when we include null polity scores.

I ran some additional cross tabs on polity scores and life expectancy (in 10 year groups) to better visualize the data, but I found that it was not as insightful if I could visualize this in a chart.

#catneededsnugglesduringprogramming haveproof

0 notes

trsmit-blog · 8 years ago

Text

Week 2 - Let’s Get Programming

Below is my code. There was some code I removed, such as the percentages for alcohol consumption. I also had some code to make additional subset for anocracies (-6<x<6) and autocracies (x<-6) that was not working either. I will reach out to my comrades who know Python better to see syntax fixes I need on those.

———————–

The code:

# -*- coding: utf-8 -*- “”“ Created on Sun Aug 20 19:02:33 2017

@author: Tofu ”“” import pandas import numpy

data = pandas.read_csv(‘gapminder.csv’, low_memory=False)

#bug fix for display formats to avoid run time errors - put after loading data above pandas.set_option('display.float_format’, lambda x: ’%f’ %x)

#humber of observation and variables print(len(data)) print(len(data.columns))

#converting variables to numeric values data['polityscore’] = data['polityscore’].convert_objects(convert_numeric=True) data['lifeexpectancy’] = data['lifeexpectancy’].convert_objects(convert_numeric=True) data['alcconsumption’] = data['alcconsumption’].convert_objects(convert_numeric=True) data['co2emissions’] = data['co2emissions’].convert_objects(convert_numeric=True) data['relectricperperson’] = data['relectricperperson’].convert_objects(convert_numeric=True) data['hivrate’] = data['hivrate’].convert_objects(convert_numeric=True)

#counts and percentages of variables print (“counts of polityscore - democracy score in 2009, numeric”) c1 = data[“polityscore”].value_counts(sort=False) print (c1)

print (“percentages of polityscore - democracy score in 2009, numeric”) p1 = data[“polityscore”].value_counts(sort=False, normalize=True) print (p1)

print (“counts of lifeexpectancy - 2011 average life expectancy, years”) c2 = data[“lifeexpectancy”].value_counts(sort=False) print (c2)

print (“percentages of lifeexpectancy - 2011 average life expectancy, years”) p2 = data[“lifeexpectancy”].value_counts(sort=False, normalize=True) print (p2)

print (“counts of alcconsumption - 2008 alcohol consumption per adult, litres”) c3 = data[“alcconsumption”].value_counts(sort=False) print (c3)

print (“counts of co2emissions - cumulative emissions in 2006, metric tons”) c4 = data[“co2emissions”].value_counts(sort=False) print (c4)

print (“percentages of co2emissions - cumulative emissions in 2006, metric tons”) p4 = data[“co2emissions”].value_counts(sort=False, normalize=True) print (p4)

print (“counts of relectricperperson - residential electricity used per person in 2008, kWh”) c5 = data[“relectricperperson”].value_counts(sort=False) print (c5)

print (“percentages of relectricperperson - residential electricity used per person in 2008, kWh”) p5 = data[“relectricperperson”].value_counts(sort=False, normalize=True) print (p5)

print (“counts of hivrate -2009 estimated % of peeople aged 15 to 49 living with HIV”) c6 = data[“hivrate”].value_counts(sort=False) print (c6)

print (“percentages of hivrate - 2009 estimated % of people aged 15 to 49 living with HIV”) p6= data[“hivrate”].value_counts(sort=False, normalize=True) print (p6)

# Frequncy distribution using by group ct1=data.groupby(“polityscore”).size() print (ct1)

#making a subset of democracies sub1= data[(data[“polityscore”]>= 6)]

#make a copy of my democracy subset sub2= sub1.copy()

# frequency distributions of different governments print (“counts for democracy - polity score 6 or greater”) c7 = sub2[“polityscore”].value_counts(sort=False) print (c7)

print (“percentages of democracy - polity score 6 or greater”) p7 = sub2[“polityscore”].value_counts(sort=False, normalize=True) print (p7)

print (“counts of life expectancy in democracies”) c8 = sub2[“lifeexpectancy”].value_counts(sort=False, dropna=False) print (c8)

print (“percentages of life expectancy in democracies”) p8 = sub2[“lifeexpectancy”].value_counts(sort=False, normalize=True) print (p8)

——————–

the output (something weird happened to my output, so I had to rerun an intial subset of my code to get output)

counts of polityscore - democracy score in 2009, numeric 0.000000 6 9.000000 15 2.000000 3 -2.000000 5 8.000000 19 5.000000 7 10.000000 33 -7.000000 12 7.000000 13 3.000000 2 6.000000 10 -4.000000 6 -1.000000 4 -3.000000 6 -5.000000 2 1.000000 3 -6.000000 3 -9.000000 4 4.000000 4 -8.000000 2 -10.000000 2 Name: polityscore, dtype: int64 percentages of polityscore - democracy score in 2009, numeric 0.000000 0.037267 9.000000 0.093168 2.000000 0.018634 -2.000000 0.031056 8.000000 0.118012 5.000000 0.043478 10.000000 0.204969 -7.000000 0.074534 7.000000 0.080745 3.000000 0.012422 6.000000 0.062112 -4.000000 0.037267 -1.000000 0.024845 -3.000000 0.037267 -5.000000 0.012422 1.000000 0.018634 -6.000000 0.018634 -9.000000 0.024845 4.000000 0.024845 -8.000000 0.012422 -10.000000 0.012422 Name: polityscore, dtype: float64 counts of lifeexpectancy - 2011 average life expectancy, years 63.125000 1 79.591000 1 74.576000 1 62.475000 1 74.414000 1 79.977000 1 58.199000 1 75.670000 1 81.012000 1 72.283000 1 55.442000 1 81.855000 1 48.398000 1 68.944000 1 75.133000 1 76.126000 1 69.317000 1 65.193000 1 75.057000 1 77.685000 1 68.498000 1 62.465000 1 79.634000 1 73.911000 1 80.499000 1 61.597000 1 79.341000 1 71.017000 1 82.759000 1 68.978000 1 .. 76.954000 1 73.703000 1 79.839000 1 48.718000 1 71.172000 1 73.456000 1 48.397000 1 81.439000 1 75.246000 1 55.377000 1 74.788000 1 74.402000 1 82.338000 1 79.499000 1 81.539000 1 54.210000 1 67.017000 1 61.452000 1 73.373000 1 73.127000 1 69.245000 1 68.795000 1 72.832000 1 76.918000 1 57.937000 1 73.126000 1 64.666000 1 75.956000 1 57.379000 1 50.239000 1 Name: lifeexpectancy, Length: 189, dtype: int64 percentages of lifeexpectancy - 2011 average life expectancy, years 63.125000 0.005236 79.591000 0.005236 74.576000 0.005236 62.475000 0.005236 74.414000 0.005236 79.977000 0.005236 58.199000 0.005236 75.670000 0.005236 81.012000 0.005236 72.283000 0.005236 55.442000 0.005236 81.855000 0.005236 48.398000 0.005236 68.944000 0.005236 75.133000 0.005236 76.126000 0.005236 69.317000 0.005236 65.193000 0.005236 75.057000 0.005236 77.685000 0.005236 68.498000 0.005236 62.465000 0.005236 79.634000 0.005236 73.911000 0.005236 80.499000 0.005236 61.597000 0.005236 79.341000 0.005236 71.017000 0.005236 82.759000 0.005236 68.978000 0.005236

76.954000 0.005236 73.703000 0.005236 79.839000 0.005236 48.718000 0.005236 71.172000 0.005236 73.456000 0.005236 48.397000 0.005236 81.439000 0.005236 75.246000 0.005236 55.377000 0.005236 74.788000 0.005236 74.402000 0.005236 82.338000 0.005236 79.499000 0.005236 81.539000 0.005236 54.210000 0.005236 67.017000 0.005236 61.452000 0.005236 73.373000 0.005236 73.127000 0.005236 69.245000 0.005236 68.795000 0.005236 72.832000 0.005236 76.918000 0.005236 57.937000 0.005236 73.126000 0.005236 64.666000 0.005236 75.956000 0.005236 57.379000 0.005236 50.239000 0.005236 Name: lifeexpectancy, Length: 189, dtype: float64 counts of alcconsumption - 2008 alcohol consumption per adult, litres 15.000000 1 5.250000 1 3.990000 1 9.750000 1 0.500000 1 9.500000 1 6.560000 1 5.000000 1 4.990000 1 4.430000 1 11.010000 1 5.120000 1 7.790000 1 1.870000 1 5.920000 2 0.920000 1 3.020000 1 6.990000 1 12.050000 1 12.020000 1 3.610000 1 12.480000 1 0.280000 1 8.680000 1 0.520000 1 13.310000 1 11.410000 1 0.340000 2 9.720000 1 4.390000 1 .. 0.560000 1 7.300000 1 1.320000 1 6.420000 1 3.880000 1 10.620000 1 9.860000 1 8.550000 1 0.650000 1 10.710000 1 12.840000 1 1.290000 1 3.390000 2 10.080000 1 2.270000 1 9.460000 1 8.170000 1 1.030000 1 5.050000 1 6.660000 1 3.110000 1 7.320000 1 2.760000 1 1.640000 1 0.050000 1 16.300000 1 5.210000 1 0.320000 1 9.480000 1 8.690000 1

1718339333.333330 1 2251333.333333 1 72524250333.333298 1 248358000.000000 1 2329308666.666670 1 2401666.666667 1 .. 21351000.000000 1 2335666.666667 1 13304503666.666700 1 1414031666.666670 1 95256333.333333 1 81191000.000000 1 511107666.666667 1 30800000.000000 1 7315000.000000 1 28490000.000000 1 1839471333.333330 1 127108666.666667 1 3157700333.333330 1 78943333.333333 1 236419333.333333 1 132025666.666667 1 1146277000.000000 1 1436893333.333330 1 5214000.000000 1 3503877666.666670 1 7813666.666667 1 33341634333.333302 1 4814333.333333 1 8231666.666667 1 7601000.000000 1 20152000.000000 1 149904333.333333 1 7861553333.333330 1 322960000.000000 1 35717000.000000 1 Name: co2emissions, Length: 200, dtype: int64 percentages of co2emissions - cumulative emissions in 2006, metric tons 4286590000.000000 0.005000 8092333.333333 0.005000 1045000.000000 0.005000 23404568000.000000 0.005000 5872119000.000000 0.005000 1548044666.666670 0.005000 9155666.666667 0.005000 277170666.666667 0.005000 29758666.666667 0.005000 119958666.666667 0.005000 850666.666667 0.005000 148470666.666667 0.005000 590219666.666666 0.005000 4200940333.333330 0.005000 340090666.666667 0.005000 1286670000.000000 0.005000 14058000.000000 0.005000 41229554666.666702 0.005000 598774000.000000 0.005000 377303666.666667 0.005000 7355333.333333 0.005000 26209333.333333 0.005000 2008116000.000000 0.005000 446365333.333333 0.005000 1718339333.333330 0.005000 2251333.333333 0.005000 72524250333.333298 0.005000 248358000.000000 0.005000 2329308666.666670 0.005000 2401666.666667 0.005000

21351000.000000 0.005000 2335666.666667 0.005000 13304503666.666700 0.005000 1414031666.666670 0.005000 95256333.333333 0.005000 81191000.000000 0.005000 511107666.666667 0.005000 30800000.000000 0.005000 7315000.000000 0.005000 28490000.000000 0.005000 1839471333.333330 0.005000 127108666.666667 0.005000 3157700333.333330 0.005000 78943333.333333 0.005000 236419333.333333 0.005000 132025666.666667 0.005000 1146277000.000000 0.005000 1436893333.333330 0.005000 5214000.000000 0.005000 3503877666.666670 0.005000 7813666.666667 0.005000 33341634333.333302 0.005000 4814333.333333 0.005000 8231666.666667 0.005000 7601000.000000 0.005000 20152000.000000 0.005000 149904333.333333 0.005000 7861553333.333330 0.005000 322960000.000000 0.005000 35717000.000000 0.005000 Name: co2emissions, Length: 200, dtype: float64 counts of relectricperperson - residential electricity used per person in 2008, kWh 0.000000 5 1920.962215 1 2826.044873 1 55.794744 1 2124.608816 1 528.648051 1 2993.092660 1 187.324882 1 1494.410268 1 15.056236 1 528.787350 1 825.941111 1 314.826200 1 1585.174739 1 1490.056909 1 186.925515 1 368.434606 1 1884.299342 1 815.031091 1 70.387444 1 59.551245 1 1933.945615 1 767.970324 1 913.845660 1 31.544564 1 2123.762863 1 51.581320 1 753.209802 1 921.562111 1 4036.953993 1 .. 304.940115 1 209.094517 1 41.180003 1 920.137600 1 1831.731848 1 1690.718434 1 168.623031 1 768.428300 1 614.907287 1 4759.453844 1 38.634503 1 1411.230532 1 532.515177 1 1142.309009 1 2261.316713 1 20.288131 1 256.099151 1 404.591365 1 590.509814 1 325.839561 1 3433.932449 1 636.341383 1 38.005637 1 31.386838 1 537.104738 1 7432.130852 1 351.166594 1 97.246492 1 9.192395 1 1259.392457 1 Name: relectricperperson, Length: 132, dtype: int64 percentages of relectricperperson - residential electricity used per person in 2008, kWh 0.000000 0.036765 1920.962215 0.007353 2826.044873 0.007353 55.794744 0.007353 2124.608816 0.007353 528.648051 0.007353 2993.092660 0.007353 187.324882 0.007353 1494.410268 0.007353 15.056236 0.007353 528.787350 0.007353 825.941111 0.007353 314.826200 0.007353 1585.174739 0.007353 1490.056909 0.007353 186.925515 0.007353 368.434606 0.007353 1884.299342 0.007353 815.031091 0.007353 70.387444 0.007353 59.551245 0.007353 1933.945615 0.007353 767.970324 0.007353 913.845660 0.007353 31.544564 0.007353 2123.762863 0.007353 51.581320 0.007353 753.209802 0.007353 921.562111 0.007353 4036.953993 0.007353

304.940115 0.007353 209.094517 0.007353 41.180003 0.007353 920.137600 0.007353 1831.731848 0.007353 1690.718434 0.007353 168.623031 0.007353 768.428300 0.007353 614.907287 0.007353 4759.453844 0.007353 38.634503 0.007353 1411.230532 0.007353 532.515177 0.007353 1142.309009 0.007353 2261.316713 0.007353 20.288131 0.007353 256.099151 0.007353 404.591365 0.007353 590.509814 0.007353 325.839561 0.007353 3433.932449 0.007353 636.341383 0.007353 38.005637 0.007353 31.386838 0.007353 537.104738 0.007353 7432.130852 0.007353 351.166594 0.007353 97.246492 0.007353 9.192395 0.007353 1259.392457 0.007353 Name: relectricperperson, Length: 132, dtype: float64 counts of hivrate -2009 estimated % of peeople aged 15 to 49 living with HIV 2.000000 2 0.500000 5 2.500000 2 5.000000 1 1.500000 2 11.000000 1 1.300000 2 1.000000 4 11.500000 1 6.500000 1 13.500000 1 3.600000 1 17.800000 1 3.200000 1 14.300000 1 1.400000 1 0.100000 28 2.300000 1 3.300000 1 1.900000 1 0.700000 3 25.900000 1 5.600000 1 0.200000 15 0.400000 9 0.060000 16 0.800000 5 0.300000 10 3.100000 1 1.200000 4 5.300000 1 24.800000 1 3.400000 3 1.700000 1 23.600000 1 6.300000 1 0.450000 1 0.600000 3 13.100000 1 4.700000 1 2.900000 1 1.600000 1 0.900000 4 5.200000 1 1.100000 2 1.800000 1 Name: hivrate, dtype: int64 percentages of hivrate - 2009 estimated % of people aged 15 to 49 living with HIV 2.000000 0.013605 0.500000 0.034014 2.500000 0.013605 5.000000 0.006803 1.500000 0.013605 11.000000 0.006803 1.300000 0.013605 1.000000 0.027211 11.500000 0.006803 6.500000 0.006803 13.500000 0.006803 3.600000 0.006803 17.800000 0.006803 3.200000 0.006803 14.300000 0.006803 1.400000 0.006803 0.100000 0.190476 2.300000 0.006803 3.300000 0.006803 1.900000 0.006803 0.700000 0.020408 25.900000 0.006803 5.600000 0.006803 0.200000 0.102041 0.400000 0.061224 0.060000 0.108844 0.800000 0.034014 0.300000 0.068027 3.100000 0.006803 1.200000 0.027211 5.300000 0.006803 24.800000 0.006803 3.400000 0.020408 1.700000 0.006803 23.600000 0.006803 6.300000 0.006803 0.450000 0.006803 0.600000 0.020408 13.100000 0.006803 4.700000 0.006803 2.900000 0.006803 1.600000 0.006803 0.900000 0.027211 5.200000 0.006803 1.100000 0.013605 1.800000 0.006803 Name: hivrate, dtype: float64 polityscore -10.000000 2 -9.000000 4 -8.000000 2 -7.000000 12 -6.000000 3 -5.000000 2 -4.000000 6 -3.000000 6 -2.000000 5 -1.000000 4 0.000000 6 1.000000 3 2.000000 3 3.000000 2 4.000000 4 5.000000 7 6.000000 10 7.000000 13 8.000000 19 9.000000 15 10.000000 33 dtype: int64 counts for democracy - polity score 6 or greater 9.000000 15 8.000000 19 10.000000 33 7.000000 13 6.000000 10 Name: polityscore, dtype: int64 percentages of democracy - polity score 6 or greater 9.000000 0.166667 8.000000 0.211111 10.000000 0.366667 7.000000 0.144444 6.000000 0.111111 Name: polityscore, dtype: float64 counts of life expectancy in democracies nan 1 79.591000 1 77.005000 1 79.915000 1 48.196000 1 80.642000 1 74.414000 1 79.977000 1 74.847000 1 81.012000 1 79.499000 1 81.855000 1 80.654000 1 80.734000 1 69.317000 1 80.557000 1 79.120000 1 68.498000 1 66.618000 1 75.446000 1 81.097000 1 81.618000 1 79.341000 1 74.825000 1 67.852000 1 62.465000 1 73.737000 1 73.488000 1 74.221000 1 81.907000 1 .. 73.703000 1 73.339000 1 71.172000 1 81.439000 1 74.044000 1 76.126000 1 68.494000 1 73.371000 1 72.640000 1 80.170000 1 78.531000 1 57.134000 1 51.444000 1 81.539000 1 68.795000 1 74.522000 1 81.404000 1 73.373000 1 73.127000 1 83.394000 1 72.231000 1 54.210000 1 64.228000 1 76.918000 1 68.749000 1 73.126000 1 80.414000 1 65.438000 1 80.854000 1 73.396000 1 Name: lifeexpectancy, Length: 89, dtype: int64 percentages of life expectancy in democracies 79.591000 0.011236 77.005000 0.011236 79.915000 0.011236 48.196000 0.011236 80.642000 0.011236 74.414000 0.011236 79.977000 0.011236 74.847000 0.011236 81.012000 0.011236 79.499000 0.011236 81.855000 0.011236 80.654000 0.011236 80.734000 0.011236 69.317000 0.011236 80.557000 0.011236 79.120000 0.011236 68.498000 0.011236 66.618000 0.011236 75.446000 0.011236 81.097000 0.011236 81.618000 0.011236 79.341000 0.011236 74.825000 0.011236 67.852000 0.011236 62.465000 0.011236 73.737000 0.011236 73.488000 0.011236 74.221000 0.011236 81.907000 0.011236 69.366000 0.011236

73.703000 0.011236 73.339000 0.011236 71.172000 0.011236 81.439000 0.011236 74.044000 0.011236 76.126000 0.011236 68.494000 0.011236 73.371000 0.011236 72.640000 0.011236 80.170000 0.011236 78.531000 0.011236 57.134000 0.011236 51.444000 0.011236 81.539000 0.011236 68.795000 0.011236 74.522000 0.011236 81.404000 0.011236 73.373000 0.011236 73.127000 0.011236 83.394000 0.011236 72.231000 0.011236 54.210000 0.011236 64.228000 0.011236 76.918000 0.011236 68.749000 0.011236 73.126000 0.011236 80.414000 0.011236 65.438000 0.011236 80.854000 0.011236 73.396000 0.011236 Name: lifeexpectancy, Length: 88, dtype: float64 C:/Users/Tofu/Python Project Folder/Week 2_amended.py:20: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['polityscore’] = data['polityscore’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:21: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['lifeexpectancy’] = data['lifeexpectancy’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:22: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['alcconsumption’] = data['alcconsumption’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:23: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['co2emissions’] = data['co2emissions’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:24: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['relectricperperson’] = data['relectricperperson’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:25: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['hivrate’] = data['hivrate’].convert_objects(convert_numeric=True)

------------------------ written univariate analysis--------------

Polityscore is one of my main variables of interest. The Polity IV Project (http://www.systemicpeace.org/polityproject.html) identifies democracy with a score greater than 5, anocracies between 6 and -6, and autocracies -6 or below. I used excel to help with better refine the results above

The Gap Minder has 213 observations, 52 of which that lack values. Democracies compose of 90 observations or 42.5% of dataset. There are 48 observations for anocracies, which make up 22.5% of data. The remainder is autocracies, making up 23 observations and 10.80% of the dataset.

The other variable of main interest was life expectancies, which is a continuous and positive variable. This was a huge pivot table in Excel with values between 47.94 and 83.94 years. Out of the 213 observations, there were 22 observations with no life expectancy value. Since some few of values have more than 1 values, percentages of any non-missing values was less than 1%.

Combining both as seen in the below pivot (should also be in the output above), we see that the average life expectancy by polity score ranges from 53.7 years (polity score -1) to 78.8 years for polity score 10. Considering the range of values, it is not explicitly clear if democracies will have longer life expectancies.

Another variable we are looking is HIV rate, which is defined at the percentage of population living with HIV, for adults aged between 15 and 49 years old. There 66 observations in 213 observations with no value. The remaining 69% of data sets ranges from .06 % to 25.49% of a country population living with HIV.

0 notes

trsmit-blog · 8 years ago

Text

Week 2 - Let’s Get Programming

ThereBelow is my code. There was some code I removed, such as the percentages for alcohol consumption. I also had some code to make subset for for anocracies (-6<x<6) and autocracies (x<-6) that was not working either. I will reach out to my comrades who know Python better to see syntax fixes I need on those.

-----------------------

The code:

# -*- coding: utf-8 -*- """ Created on Sun Aug 20 19:02:33 2017

@author: Tofu """ import pandas import numpy

data = pandas.read_csv('gapminder.csv', low_memory=False)

#bug fix for display formats to avoid run time errors - put after loading data above pandas.set_option('display.float_format', lambda x: '%f' %x)

#humber of observation and variables print(len(data)) print(len(data.columns))

#counts and percentages of variables print ("counts of polityscore - democracy score in 2009, numeric") c1 = data["polityscore"].value_counts(sort=False) print (c1)

print ("percentages of polityscore - democracy score in 2009, numeric") p1 = data["polityscore"].value_counts(sort=False, normalize=True) print (p1)

print ("counts of lifeexpectancy - 2011 average life expectancy, years") c2 = data["lifeexpectancy"].value_counts(sort=False) print (c2)

print ("percentages of lifeexpectancy - 2011 average life expectancy, years") p2 = data["lifeexpectancy"].value_counts(sort=False, normalize=True) print (p2)

print ("counts of alcconsumption - 2008 alcohol consumption per adult, litres") c3 = data["alcconsumption"].value_counts(sort=False) print (c3)

print ("counts of co2emissions - cumulative emissions in 2006, metric tons") c4 = data["co2emissions"].value_counts(sort=False) print (c4)

print ("percentages of co2emissions - cumulative emissions in 2006, metric tons") p4 = data["co2emissions"].value_counts(sort=False, normalize=True) print (p4)

print ("counts of relectricperperson - residential electricity used per person in 2008, kWh") c5 = data["relectricperperson"].value_counts(sort=False) print (c5)

print ("percentages of relectricperperson - residential electricity used per person in 2008, kWh") p5 = data["relectricperperson"].value_counts(sort=False, normalize=True) print (p5)

print ("counts of hivrate -2009 estimated % of peeople aged 15 to 49 living with HIV") c6 = data["hivrate"].value_counts(sort=False) print (c6)

print ("percentages of hivrate - 2009 estimated % of people aged 15 to 49 living with HIV") p6= data["hivrate"].value_counts(sort=False, normalize=True) print (p6)

# Frequncy distribution using by group ct1=data.groupby("polityscore").size() print (ct1)

#making a subset of democracies sub1= data[(data["polityscore"]>= 6)]

#make a copy of my democracy subset sub2= sub1.copy()

# frequency distributions of different governments print ("counts for democracy - polity score 6 or greater") c7 = sub2["polityscore"].value_counts(sort=False) print (c7)

print ("percentages of democracy - polity score 6 or greater") p7 = sub2["polityscore"].value_counts(sort=False, normalize=True) print (p7)

print ("counts of life expectancy in democracies") c8 = sub2["lifeexpectancy"].value_counts(sort=False, dropna=False) print (c8)

print ("percentages of life expectancy in democracies") p8 = sub2["lifeexpectancy"].value_counts(sort=False, normalize=True) print (p8)

--------------------

the output (something weird happened to my output, so I had to rerun an intial subset of my code to get output)

73.703000 0.011236 73.339000 0.011236 71.172000 0.011236 81.439000 0.011236 74.044000 0.011236 76.126000 0.011236 68.494000 0.011236 73.371000 0.011236 72.640000 0.011236 80.170000 0.011236 78.531000 0.011236 57.134000 0.011236 51.444000 0.011236 81.539000 0.011236 68.795000 0.011236 74.522000 0.011236 81.404000 0.011236 73.373000 0.011236 73.127000 0.011236 83.394000 0.011236 72.231000 0.011236 54.210000 0.011236 64.228000 0.011236 76.918000 0.011236 68.749000 0.011236 73.126000 0.011236 80.414000 0.011236 65.438000 0.011236 80.854000 0.011236 73.396000 0.011236 Name: lifeexpectancy, Length: 88, dtype: float64 C:/Users/Tofu/Python Project Folder/Week 2_amended.py:20: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:21: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:22: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:23: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:24: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:25: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)

0 notes