trsmit-blog
trsmit-blog
CatLadyCave
13 posts
Journal of my data science skills expansion with a boss kitty in the background
Don't wanna be here? Send us removal request.
trsmit-blog · 8 years ago
Text
Logistic Regression
To prepare, I created a variable suicidemedian where 1= is greater than median of 8.26 . I wanted to create a binary variable to see what contribute to higher than median rates of self-afflicted mortality per 100,000 person in a country. I did this in Excel. I originally tested polity score groups (anocracy, democracy, and autocracy) to see it this affect it, but all their estimated coefficients had greater than 0.05, implying no statistical association. This didn’t c change even when employment rate was introduces. So I made my primary focus to urban rate and internet use in looking at models rate to test if there are statistically association with higher than median suicide rates; eventually I settled on urban rate as my primary explanatory variable of interest
 After adjusting for potentially confounding factors (alcohol consumption, internet rate use, and employment rate), a one percent increase of the population living in city is associated with lower odds of a country having higher than median suicide rates on average (OR= 0.958, CI= 0.936 to 0.982, p=0.001). This negative association and its general statistical significance does not change with introduction of new variables, which suggests that the other variables do not confound this realize; evidence show potentially confounding factors exist for internet use rate. All other variables, employment rate and internet use rate, does not affect the odds of higher than median suicide rates. In the final model, we also found that alcohol consumption to be significantly associated with having higher odds to have higher than median suicide rates (OR= 1.17,CI: 1.06 to 1.29, p=0.01). We can also interpret this as an increase of alcohol liter per capita for a country’s population (aged 15 and older) has 17% more chance of being associated with higher than median suicide rate.
More steps to make this analysis more rigorous is to include more variables, which can be done.
The output:
Optimization terminated successfully.         Current function value: 0.688623         Iterations 4                           Logit Regression Results                           ============================================================================== Dep. Variable:          suicidemedian   No. Observations:                  152 Model:                          Logit   Df Residuals:                      149 Method:                           MLE   Df Model:                            2 Date:                Sun, 03 Dec 2017   Pseudo R-squ.:                0.004537 Time:                        19:47:25   Log-Likelihood:                -104.67 converged:                       True   LL-Null:                       -105.15                                        LLR p-value:                    0.6206 ==================================================================================================================                                                     coef    std err          z      P>|z|      [0.025      0.975] ------------------------------------------------------------------------------------------------------------------ Intercept                                          0.1866      0.217      0.861      0.389      -0.238       0.611 C(polityscoreg, Treatment(reference=3))[T.1.0]    -0.4743      0.491     -0.965      0.334      -1.437       0.489 C(polityscoreg, Treatment(reference=3))[T.2.0]    -0.0531      0.369     -0.144      0.886      -0.776       0.670 ================================================================================================================== Odds Ratios Intercept                                         1.205128 C(polityscoreg, Treatment(reference=3))[T.1.0]    0.622340 C(polityscoreg, Treatment(reference=3))[T.2.0]    0.948328 dtype: float64                                                Lower CI  Upper CI        OR Intercept                                       0.788241  1.842500  1.205128 C(polityscoreg, Treatment(reference=3))[T.1.0]  0.237599  1.630090  0.622340 C(polityscoreg, Treatment(reference=3))[T.2.0]  0.460062  1.954794  0.948328 Optimization terminated successfully.         Current function value: 0.679398         Iterations 4                           Logit Regression Results                           ============================================================================== Dep. Variable:          suicidemedian   No. Observations:                  152 Model:                          Logit   Df Residuals:                      148 Method:                           MLE   Df Model:                            3 Date:                Sun, 03 Dec 2017   Pseudo R-squ.:                 0.01787 Time:                        19:47:25   Log-Likelihood:                -103.27 converged:                       True   LL-Null:                       -105.15                                        LLR p-value:                    0.2887 ==================================================================================================================                                                     coef    std err          z      P>|z|      [0.025      0.975] ------------------------------------------------------------------------------------------------------------------ Intercept                                         -1.3885      0.976     -1.423      0.155      -3.301       0.524 C(polityscoreg, Treatment(reference=3))[T.1.0]    -0.5399      0.498     -1.085      0.278      -1.515       0.435 C(polityscoreg, Treatment(reference=3))[T.2.0]    -0.1797      0.381     -0.471      0.637      -0.927       0.568 employrate                                         0.0274      0.017      1.653      0.098      -0.005       0.060 ==================================================================================================================                                                Lower CI  Upper CI        OR Intercept                                       0.036835  1.689407  0.249457 C(polityscoreg, Treatment(reference=3))[T.1.0]  0.219772  1.545505  0.582802 C(polityscoreg, Treatment(reference=3))[T.2.0]  0.395775  1.763889  0.835525 employrate                                      0.994934  1.061672  1.027761 Optimization terminated successfully.         Current function value: 0.677847         Iterations 4                           Logit Regression Results                           ============================================================================== Dep. Variable:          suicidemedian   No. Observations:                  152 Model:                          Logit   Df Residuals:                      150 Method:                           MLE   Df Model:                            1 Date:                Sun, 03 Dec 2017   Pseudo R-squ.:                 0.02011 Time:                        19:47:25   Log-Likelihood:                -103.03 converged:                       True   LL-Null:                       -105.15                                        LLR p-value:                   0.03972 ==============================================================================                 coef    std err          z      P>|z|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept      0.9310      0.441      2.111      0.035       0.066       1.796 urbanrate     -0.0149      0.007     -2.027      0.043      -0.029      -0.000 ============================================================================== Odds Ratios           Lower CI  Upper CI        OR Intercept  1.068734  6.023096  2.537141 urbanrate  0.971097  0.999509  0.985201 Optimization terminated successfully.         Current function value: 0.673595         Iterations 4                           Logit Regression Results                           ============================================================================== Dep. Variable:          suicidemedian   No. Observations:                  152 Model:                          Logit   Df Residuals:                      148 Method:                           MLE   Df Model:                            3 Date:                Sun, 03 Dec 2017   Pseudo R-squ.:                 0.02626 Time:                        19:47:25   Log-Likelihood:                -102.39 converged:                       True   LL-Null:                       -105.15                                        LLR p-value:                    0.1373 ==================================================================================================================                                                     coef    std err          z      P>|z|      [0.025      0.975] ------------------------------------------------------------------------------------------------------------------ Intercept                                          1.1798      0.526      2.241      0.025       0.148       2.212 C(polityscoreg, Treatment(reference=3))[T.1.0]    -0.4857      0.500     -0.972      0.331      -1.465       0.494 C(polityscoreg, Treatment(reference=3))[T.2.0]    -0.3191      0.397     -0.804      0.422      -1.097       0.459 urbanrate                                         -0.0165      0.008     -2.095      0.036      -0.032      -0.001 ================================================================================================================== Odds Ratios                                                Lower CI  Upper CI        OR Intercept                                       1.159579  9.129449  3.253662 C(polityscoreg, Treatment(reference=3))[T.1.0]  0.231042  1.638478  0.615269 C(polityscoreg, Treatment(reference=3))[T.2.0]  0.333776  1.582551  0.726786 urbanrate                                       0.968590  0.998936  0.983646 Optimization terminated successfully.         Current function value: 0.685060         Iterations 4                           Logit Regression Results                           ============================================================================== Dep. Variable:          suicidemedian   No. Observations:                  152 Model:                          Logit   Df Residuals:                      150 Method:                           MLE   Df Model:                            1 Date:                Sun, 03 Dec 2017   Pseudo R-squ.:                0.009688 Time:                        19:47:25   Log-Likelihood:                -104.13 converged:                       True   LL-Null:                       -105.15                                        LLR p-value:                    0.1535 ===================================================================================                      coef    std err          z      P>|z|      [0.025      0.975] ----------------------------------------------------------------------------------- Intercept          -0.1691      0.252     -0.671      0.503      -0.663       0.325 internetuserate     0.0085      0.006      1.415      0.157      -0.003       0.020 =================================================================================== Odds Ratios                 Lower CI  Upper CI        OR Intercept        0.515085  1.384324  0.844420 internetuserate  0.996733  1.020495  1.008544 Optimization terminated successfully.         Current function value: 0.624638         Iterations 5                           Logit Regression Results                           ============================================================================== Dep. Variable:          suicidemedian   No. Observations:                  152 Model:                          Logit   Df Residuals:                      149 Method:                           MLE   Df Model:                            2 Date:                Sun, 03 Dec 2017   Pseudo R-squ.:                 0.09703 Time:                        19:47:25   Log-Likelihood:                -94.945 converged:                       True   LL-Null:                       -105.15                                        LLR p-value:                 3.707e-05 ===================================================================================                      coef    std err          z      P>|z|      [0.025      0.975] ----------------------------------------------------------------------------------- Intercept           1.4958      0.493      3.035      0.002       0.530       2.462 urbanrate          -0.0460      0.012     -3.953      0.000      -0.069      -0.023 internetuserate     0.0355      0.009      3.747      0.000       0.017       0.054 =================================================================================== Odds Ratios                 Lower CI   Upper CI        OR Intercept        1.698807  11.723264  4.462685 urbanrate        0.933483   0.977063  0.955025 internetuserate  1.017099   1.055633  1.036187 Optimization terminated successfully.         Current function value: 0.621730         Iterations 5                           Logit Regression Results                           ============================================================================== Dep. Variable:          suicidemedian   No. Observations:                  152 Model:                          Logit   Df Residuals:                      148 Method:                           MLE   Df Model:                            3 Date:                Sun, 03 Dec 2017   Pseudo R-squ.:                  0.1012 Time:                        19:47:25   Log-Likelihood:                -94.503 converged:                       True   LL-Null:                       -105.15                                        LLR p-value:                 9.166e-05 ===================================================================================                      coef    std err          z      P>|z|      [0.025      0.975] ----------------------------------------------------------------------------------- Intercept           0.3630      1.299      0.279      0.780      -2.184       2.910 urbanrate          -0.0436      0.012     -3.656      0.000      -0.067      -0.020 internetuserate     0.0356      0.010      3.736      0.000       0.017       0.054 employrate          0.0168      0.018      0.935      0.350      -0.018       0.052 =================================================================================== Odds Ratios                 Lower CI   Upper CI        OR Intercept        0.112636  18.350232  1.437672 urbanrate        0.935288   0.980001  0.957384 internetuserate  1.017057   1.055736  1.036216 employrate       0.981744   1.053452  1.016967 Optimization terminated successfully.         Current function value: 0.579231         Iterations 6                           Logit Regression Results                           ============================================================================== Dep. Variable:          suicidemedian   No. Observations:                  152 Model:                          Logit   Df Residuals:                      147 Method:                           MLE   Df Model:                            4 Date:                Sun, 03 Dec 2017   Pseudo R-squ.:                  0.1627 Time:                        19:47:26   Log-Likelihood:                -88.043 converged:                       True   LL-Null:                       -105.15                                        LLR p-value:                 6.751e-07 ===================================================================================                      coef    std err          z      P>|z|      [0.025      0.975] ----------------------------------------------------------------------------------- Intercept          -0.5781      1.399     -0.413      0.679      -3.320       2.164 urbanrate          -0.0423      0.012     -3.423      0.001      -0.066      -0.018 internetuserate     0.0194      0.011      1.824      0.068      -0.001       0.040 employrate          0.0226      0.019      1.162      0.245      -0.016       0.061 alcconsumption      0.1606      0.048      3.312      0.001       0.066       0.256 =================================================================================== Odds Ratios                 Lower CI  Upper CI        OR Intercept        0.036148  8.704849  0.560946 urbanrate        0.935710  0.982104  0.958626 internetuserate  0.998556  1.041010  1.019562 employrate       0.984610  1.062516  1.022821 alcconsumption   1.067738  1.291226  1.174176
The code:
@author: Tofu """
import numpy import pandas import statsmodels.api as sm import seaborn import statsmodels.formula.api as smf
data = pandas.read_csv('gapminder3.csv', low_memory=False)
data['suicidemedian'] =pandas.to_numeric(data['suicidemedian'], errors='coerce') data['alcconsumption'] = pandas.to_numeric(data['alcconsumption'], errors='coerce') data['polityscoreg'] = pandas.to_numeric(data['polityscoreg'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data['employrate'] = pandas.to_numeric(data['employrate'], errors='coerce') data['internetuserate'] =pandas.to_numeric(data['internetuserate'], errors='coerce')
#REPLACING MISSING VALUES
data['internetuserate']=data['internetuserate'].replace(" ", numpy.nan) data['suicidemedian']=data['suicidemedian'].replace("",numpy.nan) data['alcconsumption']=data['alcconsumption'].replace(99,numpy.nan) data['armedforcesrate']=data['armedforcesrate'].replace(" ",numpy.nan) data['urbanrate']=data['urbanrate'].replace(" ",numpy.nan) data['employrate']=data['employrate'].replace(" ",numpy.nan) data['polityscoreg']=data['polityscoreg'].replace(99,numpy.nan)
sub1 = data[['suicidemedian', 'polityscoreg', 'alcconsumption', 'urbanrate', 'employrate', 'internetuserate']].dropna()
#recoding polity score
# logistic regression with polityscore lreg1 = smf.logit(formula = 'suicidemedian ~ C(polityscoreg, Treatment(reference=3))', data = sub1).fit() print (lreg1.summary()) # odds ratios print ("Odds Ratios") print (numpy.exp(lreg1.params))
# odd ratios with 95% confidence intervals params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression with polity score and employment rate lreg2 = smf.logit(formula = 'suicidemedian ~ C(polityscoreg, Treatment(reference=3)) + employrate', data = sub1).fit() print (lreg2.summary())
# odd ratios with 95% confidence intervals params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression with urban rate lreg3 = smf.logit(formula = 'suicidemedian ~ urbanrate', data = sub1).fit() print (lreg3.summary())
# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg3.params conf = lreg3.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression with urban rate and polity score lreg4 = smf.logit(formula = 'suicidemedian ~ urbanrate + C(polityscoreg, Treatment (reference=3))', data = sub1).fit() print (lreg4.summary())
# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg4.params conf = lreg4.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression with inter net use rate lreg5 = smf.logit(formula = 'suicidemedian ~ internetuserate', data=sub1).fit() print (lreg5.summary())
# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg5.params conf = lreg5.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression with urban rate and internet use rate lreg6 = smf.logit(formula = 'suicidemedian ~ urbanrate + internetuserate', data = sub1).fit() print (lreg6.summary())
# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg6.params conf = lreg6.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression with urban rate and internet use rate lreg7 = smf.logit(formula = 'suicidemedian ~ urbanrate + internetuserate + employrate', data = sub1).fit() print (lreg7.summary())
# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg7.params conf = lreg7.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
# logistic regression with urban rate and internet use rate lreg8 = smf.logit(formula = 'suicidemedian ~ urbanrate + internetuserate + employrate + alcconsumption', data = sub1).fit() print (lreg8.summary())
# odd ratios with 95% confidence intervals print ("Odds Ratios") params = lreg8.params conf = lreg8.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (numpy.exp(conf))
0 notes
trsmit-blog · 8 years ago
Text
Multiple Regression
I ran a multiple regression model for life expectancy to include polity score, polity score squared, alcohol consumption, and HIV rate. My previous hypothesis was that higher polity score (aka more democractic) regimes will have a positive impact.  This did not present itself clearly in previous analysis, where anocracies (-5 to 5) have lower life expectancy than autocracies and democracies. Looking a scatter plot of life expectancy versus polity score, there seems to be more of a curvilinear relationship, though there are lots of variation outside of predicted lines.
Tumblr media
I presume that higher HIV rate will be associated with lower life expectancy and that alcohol consumption will also have a negative relationship with lower life expectancy. I centered the variables to best of my degree and their post centering means were very close to zero
 The model estimated
Life Expectancy = Intercept + b1*polity score + b2* (polity score squared) + b3*HIV rate + b4* alcohol consumption
Predicted:
Life Expectancy = 62.8 + 1.3*polity score + 0.2*polity score squared – 1.1*HIV rate + 0.2* alcohol consumption
All coefficient but alcohol consumption’s coefficients were statistically significant, or in other words the p-value of less than 0.05 signals that we could reject the null hypothesis of no association (n=134). For polity score, one increase in a country’s polity score is associated with 1.3 year increase in life expectancy on average; when square, an increase is associated with a smaller increase of .2 year increase in life expectancy. An increase of one person living with HIV per 100 in age group 15 to 49 for a country is associated with a decrease of 1.1 years in life expectancy on average, holding everything else constant.
Considering that the estimated coefficient on polity score remain positive and significant throughout additional variables does not signal a confounding variable. However, if I were being real rigorous, then I would try to incorporate a country’s democracy and anocracy score instead of polity score. But this exercise shows that there are positive association. Yet, the large amount of variation in the data in the previously mentioned scatter plot makes me concerned about the normality of residuals, which we need to prove.
Tumblr media
 When I plot a quantile-quantile chart of the above regression’s residuals, I see that are deviations from the line which signals the errors are not normal. When we normalized the residuals and plot them, we do also see a few extreme outliers, which again signal this model is poor fit. Looking regression plots for HIV rate, an additional variable, I see inconsistent variation, which is also can signal heteroscedasticity (variation in residuals are affected by value of explanatory variables).
Tumblr media Tumblr media
\A leverage plot shows that there are outliers, though all outliers then have influence less than 0.05. Any observations that have high leverage are within 2 standard deviations. Overall, this model may show that life expectancy have a positive curvilinear association with polity score, but model needs additional factors.
The Output:
2.0447272971513674e-15
-2.444088097376244e-16
1.0031943301648897e-15
                          OLS Regression Results                            
==============================================================================
Dep. Variable:         lifeexpectancy   R-squared:                       0.148
Model:                            OLS   Adj. R-squared:                  0.141
Method:                 Least Squares   F-statistic:                     23.71
Date:                Sun, 26 Nov 2017   Prob (F-statistic):           3.05e-06
Time:                       21:19:13   Log-Likelihood:                -509.27
No. Observations:                 139   AIC:                             1023.
Df Residuals:                     137   BIC:                             1028.
Df Model:                           1                                        
Covariance Type:           nonrobust                                        
=================================================================================
                  coef    std err          t     P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       68.4358      0.806     84.860     0.000      66.841      70.031
polityscore_c     0.6632      0.136      4.869     0.000       0.394       0.932
==============================================================================
Omnibus:                        8.855   Durbin-Watson:                   1.335
Prob(Omnibus):                  0.012   Jarque-Bera (JB):                9.015
Skew:                          -0.585   Prob(JB):                       0.0110
Kurtosis:                       2.568   Cond. No.                         5.92
==============================================================================
 Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                          OLS Regression Results                            
==============================================================================
Dep. Variable:         lifeexpectancy   R-squared:                       0.394
Model:                            OLS   Adj. R-squared:                  0.385
Method:                 Least Squares   F-statistic:                     44.13
Date:               Sun, 26 Nov 2017   Prob (F-statistic):           1.69e-15
Time:                       21:19:14   Log-Likelihood:                -485.60
No. Observations:                 139   AIC:                             977.2
Df Residuals:                     136   BIC:                             986.0
Df Model:                           2                                        
Covariance Type:           nonrobust                                        
=========================================================================================
                          coef    std err          t     P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept               61.9192      1.112     55.703     0.000      59.721      64.117
polityscore_c             1.6083      0.172      9.367     0.000       1.269       1.948
I(polityscore_c ** 2)     0.1859      0.025      7.428     0.000       0.136       0.235
==============================================================================
Omnibus:                       13.030   Durbin-Watson:                   1.461
Prob(Omnibus):                 0.001   Jarque-Bera (JB):               13.896
Skew:                          -0.691   Prob(JB):                     0.000961
Kurtosis:                       3.701   Cond. No.                         87.9
==============================================================================
 Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                          OLS Regression Results                            
==============================================================================
Dep. Variable:         lifeexpectancy   R-squared:                       0.647
Model:                            OLS   Adj. R-squared:                  0.637
Method:                 Least Squares   F-statistic:                     61.49
Date:               Sun, 26 Nov 2017   Prob (F-statistic):           2.10e-29
Time:                       21:19:14   Log-Likelihood:                -447.93
No. Observations:                 139   AIC:                             905.9
Df Residuals:                     134   BIC:                             920.5
Df Model:                           4                                        
Covariance Type:           nonrobust                                        
=========================================================================================
                          coef    std err          t     P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept               62.8458      0.874     71.904     0.000      61.117      64.575
polityscore_c             1.3346      0.147      9.060     0.000       1.043       1.626
I(polityscore_c ** 2)     0.1594      0.020      7.995     0.000       0.120       0.199
hivrate_c               -1.1382      0.118     -9.632     0.000      -1.372      -0.904
alcconsumption_c         0.2065      0.114      1.806     0.073      -0.020       0.433
==============================================================================
Omnibus:                        8.109   Durbin-Watson:                   1.705
Prob(Omnibus):                  0.017   Jarque-Bera (JB):                9.125
Skew:                          -0.409   Prob(JB):                       0.0104
Kurtosis:                       3.952   Cond. No.                         90.1
==============================================================================
 Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Tumblr media
 The code:
Created on Sun Nov 26 20:06:23 2017
 @author: Tofu
"""
 import numpy
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn
 # bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%.2f'%x)
 data = pandas.read_csv('gapminder2.csv')
 # convert to numeric format
data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce')
data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce')
data['relectric'] = pandas.to_numeric(data['relectric'], errors='coerce')
data['hivrate'] = pandas.to_numeric(data['hivrate'],errors='coerce')
data['alcconsumption']=pandas.to_numeric(data['alcconsumption'], errors='coerce')
 #converting values into N/A
data['lifeexpectancy']=data["lifeexpectancy"].replace(0, numpy.nan)
data['polityscore']=data['polityscore'].replace(11, numpy.nan)
data['alcconsumption']=data['alcconsumption'].replace(99,numpy.nan)
data['hivrate']=data['hivrate'].replace(99,numpy.nan)
 #creating subset
sub1 = data[['lifeexpectancy', 'polityscore', 'alcconsumption', 'hivrate']].dropna()
 # first order (linear) scatterplot
scat1 = seaborn.regplot(x="polityscore", y="lifeexpectancy", scatter=True, data=sub1)
plt.xlabel('Polity Score')
plt.ylabel('Life Expectancy')
 # fit second order polynomial
# run the 2 scatterplots together to get both linear and second order fit lines
scat1 = seaborn.regplot(x="polityscore", y="lifeexpectancy", scatter=True, order=2, data=sub1)
plt.xlabel('Polity Score')
plt.ylabel('Life Expectancy')
 #I don't think based on the nature of this than scatter plot, though a curviinear relationship makes more sense
#Center those variables
sub1['polityscore_c'] = (sub1['polityscore'] - sub1['polityscore'].mean())
print (sub1['polityscore_c'].mean())
 sub1['hivrate_c'] = (sub1['hivrate'] - sub1['hivrate'].mean())
print (sub1['hivrate_c'].mean())
 sub1['alcconsumption_c'] = (sub1['alcconsumption'] - sub1['alcconsumption'].mean())
print (sub1['alcconsumption_c'].mean())
 sub1[["polityscore_c", "hivrate_c","alcconsumption_c"]].describe()
 # linear regression analysis
reg1 = smf.ols('lifeexpectancy ~ polityscore_c', data=sub1).fit()
print (reg1.summary())
 # quadratic (polynomial) regression analysis
 # run following line of code if you get PatsyError 'ImaginaryUnit' object is not callable
reg2 = smf.ols('lifeexpectancy ~ polityscore_c + I(polityscore_c**2)', data=sub1).fit()
print (reg2.summary())
 ####################################################################################
# EVALUATING MODEL FIT
####################################################################################
 # adding other variable
reg3 = smf.ols('lifeexpectancy ~ polityscore_c + I(polityscore_c**2) + hivrate_c + alcconsumption_c', data=sub1).fit()
print (reg3.summary())
 #Q-Q plot for normality, need to added the plt fo
fig1=sm.qqplot(reg3.resid, line='r')
plt.show(fig1)
  # simple plot of residuals
stdres=pandas.DataFrame(reg3.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
 #have some extreme outliers, so evidence that model is poorly fair
 # additional regression diagnostic plots
fig2 = plt.figure(figsize=(12,8))
fig2 = sm.graphics.plot_regress_exog(reg3, 'hivrate_c', fig=fig2)
plt.show(fig2)
  # leverage plot
fig3=sm.graphics.influence_plot(reg3, size=8)
plt.show(fig3)
0 notes
trsmit-blog · 8 years ago
Text
Basic Regression Time
For this exercise, I decide to test if the presence of residential electricity has an association with higher life expectancy. I had created this variable earlier in earlier exercise to have 0 as no residential electricity data and 1 as presence of electricity data. I would expect if there is enough residential electricity to track, that is an indicator of infrastructure that could improve quality of life.
When I did a basic regression, predicted life expectancy was estimated as:
Predicted life expectancy = 65.5 + 6.3*relectric
This implies that the presence of residential electricity is associated with 6.3 years increase, on average, in life expectancy, holding all else constant. The coefficient has a p-value less than 0.001 (t=4.36, n=191). So this does suggest the presence of residential electricity does have a statistically significant, positive association with life expectancy. This model did have a F-statistic of 19.05 (n=191, p=value<0.001), however the R-statistic was only 0.091. In other words, the explanatory variable only explains 9% of the variation in life expectancy.
                           OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.091 Model:                            OLS   Adj. R-squared:                  0.087 Method:                 Least Squares   F-statistic:                     19.01 Date:                Sun, 19 Nov 2017   Prob (F-statistic):           2.13e-05 Time:                        20:56:50   Log-Likelihood:                -695.51 No. Observations:                 191   AIC:                             1395. Df Residuals:                     189   BIC:                             1402. Df Model:                           1                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975] ------------------------------------------------------------------------------ Intercept     65.4802      1.188     55.117      0.000      63.137      67.824 relectric      6.2786      1.440      4.360      0.000       3.438       9.119 ============================================================================== Omnibus:                       16.588   Durbin-Watson:                   1.456 Prob(Omnibus):                  0.000   Jarque-Bera (JB):               18.275 Skew:                          -0.729   Prob(JB):                     0.000108 Kurtosis:                       2.586   Cond. No.                         3.30 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Mean           alcconsumption  co2emissions    hivrate  lifeexpectancy  \ relectric                                                             0               26.499836  1.667740e+08  45.900984       65.480164   1                7.986538  7.619242e+09  14.158077       71.758715  
          polityscore  relectricperperson  urbanrate  polityscoreg  \ relectric                                                             0             5.606557           -0.934426  43.392069     41.950820   1             4.484615         1213.235473  61.527846      6.892308  
          urbanratem   relectric               0            0.275862   1            0.584615   Standard deviation           alcconsumption  co2emissions    hivrate  lifeexpectancy  \ relectric                                                             0               40.040633  8.069170e+08  47.922677       10.567628   1                9.564578  3.162933e+10  33.203256        8.613916  
          polityscore  relectricperperson  urbanrate  polityscoreg  \ relectric                                                             0             6.159758            0.249590  23.351088     47.937955   1             6.478629         1703.060442  21.044561     20.352798  
          urbanratem   relectric               0            0.450851   1            0.494695   Traceback (most recent call last):
 File "<ipython-input-5-f58e05b93ef4>", line 1, in <module>    runfile('C:/Users/Tofu/Python Project Folder/Basic Regression HW.py', wdir='C:/Users/Tofu/Python Project Folder')
 File "C:\Users\Tofu\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile    execfile(filename, namespace)
 File "C:\Users\Tofu\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile    exec(compile(f.read(), filename, 'exec'), namespace)
 File "C:/Users/Tofu/Python Project Folder/Basic Regression HW.py", line 41, in <module>    plt.xlabel('Presence of Residential Electricity Data')
NameError: name 'plt' is not defined
Tumblr media
the code:
@author: Tofu """
import numpy as numpy import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import seaborn
data = pandas.read_csv('gapminder2.csv', low_memory=False)
data['urbanratem'] = pandas.to_numeric(data['urbanratem'], errors='coerce') data['polityscoreg'] = pandas.to_numeric(data['polityscoreg'], errors='coerce') data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce') data['relectric'] = pandas.to_numeric(data['relectric'], errors='coerce')
sub1=data[(data['lifeexpectancy']>0)]
data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan) data['polityscoreg']=data['polityscoreg'].replace(99, numpy.nan) data['urbanratem']=data['urbanratem'].replace(' ',numpy.nan)
# impact of residential electricity reg1 = smf.ols('lifeexpectancy ~ relectric', data=sub1).fit() print (reg1.summary())
# group means & sd print ("Mean") ds1 = sub1.groupby('relectric').mean() print (ds1) print ("Standard deviation") ds2 = sub1.groupby('relectric').std() print (ds2)
# bivariate bar graph scat1 = seaborn.factorplot(x="relectric", y="lifeexpectancy", data=sub1, kind="bar", ci=None) plt.xlabel('Presence of Residential Electricity Data') plt.ylabel('Mean Life Expectancy') print (scat1)
0 notes
trsmit-blog · 8 years ago
Text
Writing about the Data - Gapminder
Sample:
The Gapminder dataset looks at multiple variables for over 213 countries across the globe. The explanatory variable in my analysis is the polity score, which helps classifies a country’s political regime on a 21 point, from -10 to 10, scale to denotes relative levels of democracy and/or autocracy. The countries were broken down into democracies (polity score over 5, n=90, 42.5%), anocracies (polity score between -6 and 6, n=48, 22.5%), autocracies (polity score less than -5, n=23, 10.80%), and no classifications (no polity score, n=52). Life expectancies is the dependent variable in my analysis, though there were 22 countries without life expectancies. Though most countries without life expectancy were also without a polity score.
Procedures:
Polity score data is collected by researchers who assess an individual country regime in terms what how democratic and autocratic they were. The factors assess and weighted into democracy index were the how effective citizens can express preferences about alternative politics and leaders, existing institutional constraints on executive’s exercise of power, and the guarantee of civil liberties to all citizens in everyday life and political participation in a 11 point scale (0 to 10) Autocracies score weighed how competitive political participation is restricted or suppressed, how political elites regularized the selection of chief executive, and what institutional constraints are on exercising executive power in 11 point scale (-10 to 0). Since the 2000s, polity scores undergo regular annual updates and tests for assessor’s variances. The polity score reflects the 2009 assessment.
Life expectancies sources vary from The Human Mortality database, World Population Prospects, The Human Life-Table Database, and research of James C. Riley. While life expectancies across countries utilize various sources, they try to collect the following for each country:
- births (annual counts of live births by sex),
- deaths (counts with most details possible; if no data, estimated death counts by completed age),
- population size as of January 1st (if not, then derived from census data, births, and death),
- exposure to risk (estimates of the population exposed to the risk of death during some age-time interval for populations measured on January 1st)
- death rates:  ratio of the death count for a given age-time interval divided by an estimate of the exposure to-risk in the same interval.
The citations at the bottoms have more information for those curious.
Measures
The life expectancies within the Gapminder dataset used can be considered as the expected number of years a new born child born in 2011 would live, assuming current mortality patterns. It is not hundred percent clear from exact source of data, the general takeaway is they try to utilize the available country census data. Polity score combines the democracy and autocracy score, ranging from -10 (fully autocratic) to 10 (fully democratic), using observed data on a country’s regime.
Life expectancy is a continuous variable while polity score is a categorical variable; both are surveillance data that incorporates multiples data sources into their respective variables. Both variables could be accessed by downloading the Gapminder dataset (www.gapminder.org) For our analysis, I did polity score to be either into two or three groups. If a tri grouping, it was broken into autocracies (-10 to -6), anocracies (-5 to 5), and democracies (6 to 10). If two groups, it would just be democracies (6 to 10) and non-democracies (-10 to 6).
 Sources:
Center for Systemic Peace. “Polity IV Projects: Political Regime Characteristics and Transitions, 1800 – 2016 Dataset Users’ Manual.” http://www.systemicpeace.org/inscr/p4manualv2016.pdf
The Human Mortiality Database. “Overview”. http://www.mortality.org/Public/Overview.php
World Population Prospects. “Data Sources.pdf”.https://esa.un.org/unpd/wpp/Download/Other/Documentation/
The Human Life-Table Database, Shkolnikov, V.M. “Methodology Notes on the Human Life-Table Database (HLD)”. http://www.lifetable.de/methodology.pdf
0 notes
trsmit-blog · 8 years ago
Text
Urban Rate Moderator
For the assignment, I seek to answer if whether there is an association between higher than median or less than median urban rate (58%) when it comes to the relationship of life expectancy and polity score group.[I created two variables, urbanratem and polityscoreg, within excel]
 As we saw in the first week of this course, there is a significant relationship between life expectancy and polity score groupings (autocracy, anocracy, and democracy).  However, the difference was by anocracies (scores whose are between -5 and 5) compared to other groupings.
 The F-statistic derived for the group with higher than median urban rate (58%) or above was 4. (df=2, 77 obs, p-value=0.022), which indicates a significant relationship with between polity score groups and life expectancy. We find this also in the countries with lower than median urban rate, though we yield a F-statistics of 7.995 (df=2, 83 obs, p-value= 0.00063).
 The life average expectancy follows the same trend regardless of urban rate group (i.e. anocracies have lowest life expectancy on average), so it suggests that urban rate is not a moderator for life expectancy and polity score.
The Code:
@author: Tofu
"""
 import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import seaborn
import matplotlib.pyplot as plt
 data = pandas.read_csv('gapminder2.csv', low_memory=False)
 data['urbanratem'] = pandas.to_numeric(data['urbanratem'], errors='coerce')
data['polityscoreg'] = pandas.to_numeric(data['polityscoreg'], errors='coerce')
data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
 data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan)
data['polityscoreg']=data['polityscoreg'].replace(99, numpy.nan)
data['urbanratem']=data['urbanratem'].replace(' ',numpy.nan)
 model1 = smf.ols(formula='lifeexpectancy ~ C(polityscoreg)', data=data).fit()
print (model1.summary())
 sub1 = data[['polityscoreg', 'lifeexpectancy','urbanrate']].dropna()
 print ("means for life expectancy by polity score group")
m1= sub1.groupby('polityscoreg').mean()
print (m1)
 print ("standard deviation for mean life expectancy by polity score group")
st1= sub1.groupby('polityscoreg').std()
print (st1)
 # bivariate bar graph
seaborn.factorplot(x="polityscoreg", y="lifeexpectancy", data=data, kind="bar", ci=None)
plt.xlabel('Polity Score')
plt.ylabel('Mean Life Expectancy')
 sub2=data[(data['urbanratem']==1)]
sub3=data[(data['urbanratem']==0)]
 print ('association between life expectancy and polity score for those in median or higher urban rates')
model2 = smf.ols(formula='lifeexpectancy ~ C(polityscoreg)', data=sub2).fit()
print (model2.summary())
 print ('association between life expectancy and polity score for those with less than median urban rates')
model3 = smf.ols(formula='lifeexpectancy ~ C(polityscoreg)', data=sub3).fit()
print (model3.summary())
  print ("means for life expectancy by polityscore for median or higher urban rates")
m3= sub2.groupby('polityscoreg').mean()
print (m3)
 print ("Means for life expectancy by polityscore for less than median urban rates")
m4 = sub3.groupby('polityscoreg').mean()
print (m4)
 -------The Results
runfile('C:/Users/Tofu/Python Project Folder/Week 8.py', wdir='C:/Users/Tofu/Python Project Folder')
                          OLS Regression Results                            
==============================================================================
Dep. Variable:         lifeexpectancy   R-squared:                       0.215
Model:                            OLS   Adj. R-squared:                  0.205
Method:                 Least Squares   F-statistic:                     21.53
Date:                Sun, 15 Oct 2017   Prob (F-statistic):           5.45e-09
Time:                       18:38:43   Log-Likelihood:                -576.38
No. Observations:                 160   AIC:                             1159.
Df Residuals:                     157   BIC:                             1168.
Df Model:                           2                                        
Covariance Type:           nonrobust                                        
==========================================================================================
                            coef    std err          t     P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 70.7910      1.869     37.885     0.000      67.100      74.482
C(polityscoreg)[T.2.0]   -9.2970      2.273     -4.091     0.000     -13.786      -4.808
C(polityscoreg)[T.3.0]     1.0312      2.096      0.492     0.623      -3.109       5.172
==============================================================================
Omnibus:                       14.930   Durbin-Watson:                   1.307
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               16.562
Skew:                          -0.781   Prob(JB):                     0.000253
Kurtosis:                       3.210   Cond. No.                         5.70
==============================================================================
 Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
means for life expectancy by polity score group
            lifeexpectancy  urbanrate
polityscoreg                          
1.0               70.791000  59.036522
2.0               61.493958  44.733750
3.0               71.822247  59.447416
standard deviation for mean life expectancy by polity score group
            lifeexpectancy  urbanrate
polityscoreg                          
1.0                 6.585543  23.141218
2.0                 8.988540  21.779372
3.0                 9.448791  21.267754
association between life expectancy and polity score for those in median or higher urban rates
                          OLS Regression Results                            
==============================================================================
Dep. Variable:         lifeexpectancy   R-squared:                       0.098
Model:                            OLS   Adj. R-squared:                  0.073
Method:                 Least Squares   F-statistic:                     4.012
Date:               Sun, 15 Oct 2017   Prob (F-statistic):             0.0222
Time:                       18:38:43   Log-Likelihood:                -250.43
No. Observations:                  77   AIC:                             506.9
Df Residuals:                      74   BIC:                             513.9
Df Model:                           2                                        
Covariance Type:           nonrobust                                        
==========================================================================================
                            coef    std err          t     P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 74.3214      1.924     38.631     0.000      70.488      78.155
C(polityscoreg)[T.2.0]   -4.1269      2.664     -1.549     0.126      -9.434       1.180
C(polityscoreg)[T.3.0]     1.6142      2.111      0.765     0.447      -2.592       5.820
==============================================================================
Omnibus:                       37.388   Durbin-Watson:                   1.635
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               82.561
Skew:                          -1.728   Prob(JB):                     1.18e-18
Kurtosis:                       6.714   Cond. No.                         6.15
==============================================================================
 Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
association between life expectancy and polity score for those with less than median urban rates
                          OLS Regression Results                          
==============================================================================
Dep. Variable:         lifeexpectancy   R-squared:                       0.167
Model:                            OLS   Adj. R-squared:                  0.146
Method:                 Least Squares   F-statistic:                     7.995
Date:               Sun, 15 Oct 2017   Prob (F-statistic):           0.000683
Time:                       18:38:43   Log-Likelihood:                -294.00
No. Observations:                 83   AIC:                             594.0
Df Residuals:                      80   BIC:                             601.3
Df Model:                           2                                        
Covariance Type:           nonrobust                                        
==========================================================================================
                            coef    std err          t     P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 67.5548      2.458     27.489     0.000      62.664      72.445
C(polityscoreg)[T.2.0]   -8.9611      2.838     -3.158     0.002     -14.608      -3.314
C(polityscoreg)[T.3.0]    -2.0789     2.848     -0.730      0.468     -7.746       3.588
==============================================================================
Omnibus:                        4.296   Durbin-Watson:                   1.741
Prob(Omnibus):                0.117   Jarque-Bera (JB):                3.341
Skew:                          -0.366   Prob(JB):                        0.188
Kurtosis:                       2.344   Cond. No.                         5.56
==============================================================================
 Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
means for life expectancy by polityscore for median or higher urban rates
            alcconsumption  co2emissions    hivrate lifeexpectancy  \
polityscoreg                                                          
1.0               11.950909  2.952225e+09  54.069091       74.321364  
2.0                 5.592500  3.133595e+09  25.821667       70.194500  
3.0               11.589444  1.394451e+10   6.609444       75.935611  
              polityscore relectricperperson  urbanrate  relectric \
polityscoreg                                                        
1.0             -7.909091         3626.599374  79.294545   0.909091  
2.0             0.416667          533.889590  75.501667   0.833333  
3.0             8.981481         1505.305644  73.417407   0.944444  
              urbanratem  
polityscoreg              
1.0                 1.0  
2.0                 1.0  
3.0                 1.0  
Means for life expectancy by polityscore for less than median urban rates
            alcconsumption  co2emissions    hivrate lifeexpectancy  \
polityscoreg                                                            
1.0                 4.959167  9.202947e+09  18.866667       67.554833  
2.0                 4.091111  4.410138e+08  13.361667       58.593778  
3.0                 6.211429  1.378032e+09  11.028571     65.475914  
              polityscore relectricperperson  urbanrate  relectric \
polityscoreg                                                        
1.0             -7.250000          311.065276  40.466667   0.833333  
2.0             -0.138889           87.447582  34.477778   0.555556  
3.0             7.800000          336.195945  37.893714   0.657143  
              urbanratem  
polityscoreg            
1.0                 0.0  
2.0                 0.0  
3.0                  0.0 
Tumblr media Tumblr media
0 notes
trsmit-blog · 8 years ago
Text
Pearson Correlation Test
Tumblr media
I decided to compare the association between urban rate and life expectancy.  My coding results are below and above. The scatter plot show a general positive linear relationship between life expectancy and urban rate (when we excluded missing values on each variable). However, there is lots of variance from the predicted line. The Pearson correlation yields a r value of .618 between urban rate and life expectancy with a p value of less than 0.01. This does suggest a statistically significant, positive linear relationship between urban rate and life expectancy. Squaring the r value, we derive a r-squared value of 0.38; that means If we know the urban rate, we can predict the 38% of the variability we’ll see in life expectancy. The other 62% of variability is unaccounted for.
 The coding and results:
import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder1.csv', low_memory=False)
# new code setting variables you will be working with to numeric data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce') data['relectric'] = pandas.to_numeric(data['relectric'], errors='coerce') data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce') data['relectricperperson']= pandas.to_numeric(data['relectricperperson'],errors='coerce') data['co2emissions'] = pandas.to_numeric(data['co2emissions'], errors='coerce') data['hivrate'] = pandas.to_numeric(data['hivrate'], errors='coerce') data['urbanrate'] = pandas.to_numeric(data['urbanrate'], errors='coerce')
data['urbanrate']=data['urbanrate'].replace(' ', numpy.nan)
data['lifeexpectancy']=data['lifeexpectancy'].replace(0, numpy.nan)
scat1 = seaborn.regplot(x="urbanrate", y="lifeexpectancy", fit_reg=True, data=data) plt.xlabel('Urban Rate') plt.ylabel('Life Expectancy') plt.title('Scatterplot for the Association Between Urban Rate and Life Expectancy')
data_clean=data.dropna()
print ('association between urbanrate and life expectancy') print (scipy.stats.pearsonr(data_clean['urbanrate'], data_clean['lifeexpectancy']))
------------------
association between urbanrate and life expectancy
(0.61870710462448963, 3.028137554378938e-21)
0 notes
trsmit-blog · 8 years ago
Text
Chi Squared Test of Independence
When examining the relationship between polity score (categorical explanation) and the presence of residential electricity consumption data (categorical response), a chi-test of independence revealed that among counties with polity scores (my sample), those that are considered democratic were NOT more likely to have residential electricity data (83%) versus those countries who are not categorized as democratic (70%), X2=8.72, 2 df, p-value= 0.06)
I did not conduct post hoc comparisons. Though it should be noted I attempted to look at groupings as autocracy (-6 or less), anocracy (-5 to 5), democracy (6 to 10) in round 2 and found some statistical significant association between polity score group and presence of residential electricity data[X2=9.17, 2 df, p-value= 0.01]. However, my coding did not worked to allow post hoc comparison. I tried a workaround by creating “polityscoreg” in Excel with the following logic
 If Polity score is equal to or less =6, code 1 (autocracy), if not
If polity score is equal to or less than 5, code 2 (anocracies), if not
If polity score is equal to or less than 10, code 3 (democracy))
 However something went wrong with my coding and I didn’t try too hard to fix because I was more focus on democracy or not for the exercise. The error was
Tumblr media
Using what I know, I would expect it would the anocracies group to be the source of significant difference, which I found in my previous post that anocracies have statistically significant expected life expectancies from either autocracies or democracies. I will try around another day.
the code (week 6, NOT week6 alt view):
""" Created on Sun Oct  1 17:19:04 2017
@author: Tofu """
import pandas import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder1.csv', low_memory=False)
# new code setting variables you will be working with to numeric data['polityscore'] = pandas.to_numeric(data['polityscore'], errors='coerce') data['relectric'] = pandas.to_numeric(data['relectric'], errors='coerce') data['lifeexpectancy'] = pandas.to_numeric(data['lifeexpectancy'], errors='coerce') data['relectricperperson']= pandas.to_numeric(data['relectricperperson'],errors='coerce') data['co2emissions'] = pandas.to_numeric(data['co2emissions'], errors='coerce') data['hivrate'] = pandas.to_numeric(data['hivrate'], errors='coerce')
#exploring life expectancy data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan)
#making subsite of complete data sub7= data[(data["polityscore"]<11)]
sub8= sub7.copy()
#dividing data set into groups sub8["polityscore"] = pandas.cut(sub8.polityscore, [-10, 5, 10])
#contingency table of observed counts ct1=pandas.crosstab(sub8['relectric'], sub8['polityscore']) print (ct1)
# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
#chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)
# set variable types sub8['polityscore']=sub8['polityscore'].astype('category') # converting to numeric sub8['relectric'] = pandas.to_numeric(sub8['relectric'], errors='coerce')
# graph percent with seaborn.factorplot(x='polityscore', y='relectric', data=sub8, kind='bar', ci=None) plt.xlabel('polity score group') plt.ylabel('proportion with residential electricity')
---------------------------the results--------------------------------
runfile('C:/Users/Tofu/Python Project Folder/Week 6.py', wdir='C:/Users/Tofu/Python Project Folder') polityscore  (-10, 5]  (5, 10] relectric                     0                  21       15 1                  48       75 polityscore  (-10, 5]   (5, 10] relectric                       0            0.304348  0.166667 1            0.695652  0.833333 chi-square value, p value, expected counts (3.4774549801657448, 0.062210347893879261, 1, array([[ 15.62264151,  20.37735849],       [ 53.37735849,  69.62264151]]))
Tumblr media
0 notes
trsmit-blog · 8 years ago
Text
Testing Life Expectancy and Government Type
Finally testing out significance! Woot!
The code:
# -*- coding: utf-8 -*- """ Created on Sun Sep 24 17:12:04 2017
@author: Tofu """
import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
data = pandas.read_csv('gapminder1.csv', low_memory=False)
#converting variables to numeric values data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True) data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True) data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
#exploring life expectancy data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan)
#making subsite of complete data sub5= data[(data["polityscore"]<11) & (data["lifeexpectancy"]!=0)]
sub6= sub5.copy(5)
#dividing data set into groups sub6["polityscore"] = pandas.cut(sub6.polityscore, [-10, -6, 5, 10])
sub6["polityscore"]=sub6["polityscore"].astype("category") #setting up the model if it was binary category
#sub2 = sub6[["polityscore", "lifeexpectancy"]].dropna()
#model1 = smf.ols(formula='lifeexpectancy ~ C(polityscore)',data=sub5) #results1= model1.fit() #print (results1.summary())
#checking the means if it was binary category #print ("means of lifeexpectancy by polity score group") #m1=sub2.groupby("polityscore").mean() #print (m1)
#checking the standard deviatio if was a binary category #print ("standard dev of lifeexpectancy by polity score group) #s1=sub2.groupby("polityscore").std() #print (s1)
sub3 = sub6[["polityscore", "lifeexpectancy"]].dropna()
model2= smf.ols(formula= 'lifeexpectancy ~ C(polityscore)', data=sub3).fit() print (model2.summary())
#running means by polity score print ("means of lifeexpectancy by polity score groups") m2=sub3.groupby('polityscore').mean() print (m2)
print ("standard devs of lifeexpectancy by polity score groups") s2=sub3.groupby('polityscore').std() print (s2)
#running post hoc test mc1= multi.MultiComparison(sub3['lifeexpectancy'], sub3['polityscore']) res1 =mc1.tukeyhsd() print (res1.summary())
The results:
                           OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.214 Model:                            OLS   Adj. R-squared:                  0.203 Method:                 Least Squares   F-statistic:                     21.05 Date:                Sun, 24 Sep 2017   Prob (F-statistic):           8.17e-09 Time:                        19:35:48   Log-Likelihood:                -569.72 No. Observations:                 158   AIC:                             1145. Df Residuals:                     155   BIC:                             1155. Df Model:                           2                                         Covariance Type:            nonrobust                                         =====================================================================================================================                                                        coef    std err          t      P>|t|      [0.025      0.975] --------------------------------------------------------------------------------------------------------------------- Intercept                                            70.2815      1.962     35.814      0.000      66.405      74.158 C(polityscore)[T.Interval(-6, 5, closed='right')]    -8.7875      2.353     -3.735      0.000     -13.435      -4.140 C(polityscore)[T.Interval(5, 10, closed='right')]     1.5408      2.182      0.706      0.481      -2.769       5.850 ============================================================================== Omnibus:                       14.489   Durbin-Watson:                   0.989 Prob(Omnibus):                  0.001   Jarque-Bera (JB):               16.036 Skew:                          -0.775   Prob(JB):                     0.000330 Kurtosis:                       3.183   Cond. No.                         5.92 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means of lifeexpectancy by polity score groups             lifeexpectancy polityscore                 (-10, -6]         70.281476 (-6, 5]           61.493958 (5, 10]           71.822247 standard devs of lifeexpectancy by polity score groups             lifeexpectancy polityscore                 (-10, -6]          6.638839 (-6, 5]            8.988540 (5, 10]            9.448791 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ==================================================  group1   group2 meandiff  lower    upper  reject -------------------------------------------------- (-10, -6] (-6, 5] -8.7875  -14.3559 -3.2192  True (-10, -6] (5, 10]  1.5408  -3.6225   6.704  False (-6, 5]  (5, 10] 10.3283   6.517   14.1396  True --------------------------------------------------
Model Interpretation for ANOVA:
When examining the association between life expectancy (the quantitative response) and polity score group (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among different types of governments in the Gap Minder data set, average life expectancy is significantly amongst the different types of government. For example, autocracies have a life expectancy mean of 70.3 years (s.d ± 6.6 years), while anocracies have a life expectancy mean of 61.5 years (s.d. ± 9 years) and democracies have a life expectancy mean of 71.82 (s.d. ± 9.4 years). The ANOVA yield a F-statistic of 21.05 with 2 degrees of freedom and 158 observations, which has a p-value of 0.000000000817.
Note that the degrees of freedom that I report in parentheses) following ‘F’ can be found in the OLS table as the DF model and DF residuals.
I did do alternative view to just compare democracies versus non-democracies. And also found evidence to reject the null hypothesis that there is no association of democracy and life expectancy. Non democracies have a life expectancy mean of 64.2 years (s.d. ± 9.2 years) while democracies have average life expectancy of 71.8 years (s.d. ± 9.4 years). The resulting F statistic was 26 (1, 156) and this F-statistic has a p-value of 0.000000098. [These results aren’t shown, but can provide code and results]
Model Interpretation for post hoc ANOVA results:
ANOVA revealed that among countries within the Gapminder dataset, different government types (group as either autocracies, anocracies, or democracies, which is the categorical explanatory variable) and the life expectancy (the quantitative response variable) were significantly associated, F (2, 155) = 21.05, p=0.000000000817. Post hoc comparison of mean life expectancy by polity score group reveal that anocracies’ average life expectancy are significantly different from both autocracies and democracies. However, autocracies, and democracies don’t have sufficient evidence to suggest that life expectancy significantly differ by living in either of one.
0 notes
trsmit-blog · 8 years ago
Text
Visualize This
This week’s homework went overall well, outside of not being able to do a histogram for life expectancy. From my previous review, I know that the distribution had little to no modes, so not visualizing does not affect me.
The code:
import pandas import numpy import seaborn import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder1.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame pandas.set_option('display.max_columns', None) #Set PANDAS to show all rows in DataFrame pandas.set_option('display.max_rows', None)
#bug fix for display formats to avoid run time errors - put after loading data above pandas.set_option('display.float_format', lambda x: '%f' %x)
#humber of observation and variables print(len(data)) print(len(data.columns))
#converting variables to numeric values data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True) data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True) data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
print ("counts of polityscore - democracy score in 2009, numeric") c1 = data["polityscore"].value_counts(sort=False) print (c1)
print ("percentages of polityscore - democracy score in 2009, numeric") p1 = data["polityscore"].value_counts(sort=False, normalize=True) print (p1)
#exploring life expectancy data["lifeexpectancy"]=data["lifeexpectancy"].replace(0, numpy.nan)
desc2=data["lifeexpectancy"].describe() print (desc2)
#making a subset of lifeexpectancy to exclude missing values sub3=data[(data["lifeexpectancy"]!=0)]
sub4=sub3.copy()
#Univariate histogram for quantitative variable: sub4["polityscore"] = sub4["polityscore"].convert_objects(convert_numeric=True)
seaborn.distplot(sub4["lifeexpectancy"].dropna(), kde=False); plt.xlabel("Life Expectancy, years") plt.title("Estimated Life Expectancy in Countries Within Gapminder Dataset")
#make subset of all polity scores, adjusting out for null values in polity scores and lifeexpectancy sub5= data[(data["polityscore"]<11) & (data["lifeexpectancy"]!=0)]
sub6= sub5.copy(5)
#recoding them into groups sub6["polityscore"] = pandas.cut(sub6.polityscore, [-10, -6, 5, 10])
print(" count of polity scores, without missing value") c7=sub6["polityscore"].value_counts(sort=False) print (c7)
#univariate bar graph for categorical variables # First hange format from numeric to categorical sub6["polityscore"]=sub6["polityscore"].astype("category")
seaborn.countplot(x="polityscore", data=sub6) plt.xlabel("polity score") plt.title("Polity Score in Gapminder Dataset, excluding missing values")
print ("Describe polityscore, group by govt type") desc1= sub6["polityscore"].describe() print (desc1)
print ("counts of polity score, group by govt type") c8=sub6["polityscore"].value_counts(sort=False) print (c8)
# second create a new variable (PACKCAT) that has the new variable value labels sub6["polityscore"]=sub6["polityscore"].cat.rename_categories(["Autocracy","Anocracy","Democracy"])
seaborn.countplot(x="polityscore", data=sub6) plt.xlabel("polity score") plt.title("Polity Score in Gapminder Dataset, excluding missing values")
# bivariate bar graph C->Q seaborn.factorplot(x="polityscore", y="lifeexpectancy", data=sub6, kind="bar", ci=None) plt.xlabel("Polity Score") plt.ylabel("Life Expectancy, years")
output and summary:
count   191.000000 mean     69.753524 std       9.708621 min      47.794000 25%      64.447000 50%      73.131000 75%      76.593000 max      83.394000
Life expectancy, without missing values, ranges from 47.8 years to 83.4 years. The (mean) average is 69.8 years, plus or minus 9.7 years. The median age is 73 years.
Tumblr media
This histogram generated in Excel shows that the there is relatively symmetric distribution if we don’t group life expectancy values. Two modes appear approximately at 72.9 and 74 years.
Polity score are categorical in nature. Countries with observed polity scores are given a integarish score from -10 to 10. Autocracy are given a score of -6 or lower, anocracies start at -5 and end at 5, while democracies have a score of 6 or greater.
Name: lifeexpectancy, dtype: float64 count of polity scores, without missing value (-10, -6]    21 (-6, 5]      48 (5, 10]      90 Name: polityscore, dtype: int64 Describe polityscore, group by govt type count         159 unique          3 top       (5, 10] freq           90
Democracies are the most common type of goverment with the Gapminder Dataset, being observed 90 of 159 times for counties with values.
Tumblr media
When incorporating life expectancy with polity score groups, we see that there is a difference between them:
Tumblr media
The average life expectancy for autocracies is close to 70 years while average life expectancy in democracies is 72 years. More surprising, life expectancy is the lowest in anocracy countries at 62 years (approx).
My original hypothesis was that democracies will have higher life expectancy, which is shown with the highest average life expectancy, but autocracies also have a comparable average life expectancy. Further testing would be needed to control for other factors and really see if this hypothesis is true for this dataset.
0 notes
trsmit-blog · 8 years ago
Text
Week 3 - Munging Around
Initial thoughts:
The Gapminder dataset didn’t use numeric values to denote missing values, so the first lesson to use numpy to convert the “missing value” indicator to be “nan” using numpy wasn’t my option as demonstrated. Additionally, this dataset is compiled from multiple other datasets and not from one survey, so there aren’t any traditionally skip jump questions. Lesson 3 thoughts: I also opted out of creating secondary variables that would be some arithmetic operations of other variables. Could have looked at carbon dioxide emissions per residential electricity use, but I don’t think that it would be make sense because one is a cumulative value over time while the other variable is electricity use in one particular year, and a portion of cumulative emissions were created outside of electricity use. However, the wide range of data values for life expectancy and other variables provide opportunity to recode them into groups.
 --well my initial plans didn’t work as expected, so I cleaned the data in Excel to deal with missing values before any further management in Python. This was only an option because of the smaller size of the dataset. So I welcome any feedback on how to do it within Python.
Polityscore: replaced missing values with 11. Then I should have to adjust my subgroup for democracy to be between 6 and 11.
Lifeexpectancy: replaced missing or null values with 0.
Relectricperperson: replaced missing value with -1.
Hivrate: replaced missing values with 99.
Co2emissions: replaced missing values with 0.
Alcconsumption: replaced missing values with 99
Then saved as gapminder1.csv -------------------------------------------------------
The code
# -*- coding: utf-8 -*-
"""
Created on Sun Aug 27 12:53:43 2017
 @author: Tofu
"""
 import pandas
import numpy
 data = pandas.read_csv('gapminder1.csv', low_memory=False)
 #bug fix for display formats to avoid run time errors - put after loading data above
pandas.set_option('display.float_format', lambda x: '%f' %x)
 #humber of observation and variables
print(len(data))
print(len(data.columns))
 #converting variables to numeric values
data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)
data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True)
data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
 print ("counts of polityscore - democracy score in 2009, numeric")
c1 = data["polityscore"].value_counts(sort=False)
print (c1)
 print ("percentages of polityscore - democracy score in 2009, numeric")
p1 = data["polityscore"].value_counts(sort=False, normalize=True)
print (p1)
 #making a subset of democracies, adjusting for null value in polityscore and lifeexpectancy variables
sub1= data[(data["polityscore"]>= 6) & (data["polityscore"]<11) & (data["lifeexpectancy"])!=0]
 #make a copy of my democracy subset
sub2= sub1.copy()
 # frequency distributions of democracy
print ("counts for democracy - polity score 6 or greater, positve life expectancy")
c7 = sub2["polityscore"].value_counts(sort=False)
print (c7)
 print ("percentages of democratic govs - polity score 6 or greater, positive life expectancy")
p7=sub2["polityscore"].value_counts(sort=False, normalize=True)
print (p7)
#making a subset of lifeexpectancy to exclude missing values
sub3=data[(data["lifeexpectancy"]!=0)]
 sub4=sub3.copy()
 #checking my subset of lifeexpectancy
print("counts for lifeexpectancy - without missing values")
c8=sub4["lifeexpectancy"].value_counts(sort=False)
print (c8)
 print("counts for lifeexpectancy - with missing values")
c2=data["lifeexpectancy"].value_counts(sort=False)
print (c2)
 #quartiles of life  in adjusted lifeexpectancy
print ("life expectancy, 4 categories or quartiles, all polity scores")
sub4["lifeexpectancy4"]=pandas.cut(sub4.lifeexpectancy, 4, labels=["1=25%tile", "2=50%tile","3=75%tile", "4=%100tile"])
c4=sub4["lifeexpectancy4"].value_counts(sort=False, dropna=True)
print (c4)
 #cross tabs of life expectancy quartiles
print (pandas.crosstab(sub4["lifeexpectancy4"], sub4["lifeexpectancy"]))
 #create own lifeexpectancy groups in adjusted subgroup lifeexpectancy for 10 year grouping
sub4["lifeexpectancy5"]=pandas.cut(sub4.lifeexpectancy, [40,50,60,70,80,90])
c5=sub4["lifeexpectancy5"].value_counts(sort=False)
print(c5)
 #crosstab of life expectancy quartile with 10 year groupings
print (pandas.crosstab(sub4["lifeexpectancy4"], sub4["lifeexpectancy5"]))
 #crosstab of life expectancy 10 year groupings with basic lifeexpectancy
print (pandas.crosstab(sub4["lifeexpectancy5"], sub4["lifeexpectancy"]))
 #counts and distribution of democracy in lifeexpectancy
print ("counts of polityscore in lifeexpectancy")
c10 = sub4["polityscore"].value_counts(sort=False)
print (c10)
 print ("percentages of polityscore in life expectancy")
p10=sub4["polityscore"].value_counts(sort=False, normalize=True)
print (p10)
 #16% of positive life expectancy observations are missing polity scores, so adjust polityscore in lifeexpectancy subset to exclude missing value 11
sub4["polityscore"]=sub4["polityscore"].replace(11, numpy.nan)
 print ("counts of polityscore in positive life expectancy,post coding 11 to be nan")
c11=sub4["polityscore"].value_counts(sort=False, dropna=False)
print (c11)
 print ("percentages of polity score in positive life expectancy, post coding 11 to be nan")
p11=sub4["polityscore"].value_counts(sort=False, normalize=True, dropna=False)
print (p11)
 print ("counts of polityscore in positive lifeexpectancy, no nan")
c12=sub4["polityscore"].value_counts(sort=False, dropna=True)
print (c12)
 print ("percentages of polity sore in positive lifeexpectancy, no nan")
p12=sub4["polityscore"].value_counts(sort=False, normalize=True, dropna=True)
print (p12)
 #crosstabs of lifeexpectancy, 10 year groups and polity scores
print ("crosstab of positive lifeexpectancy, 10 yr grouping, and polity score, lifeexpectancy as columns")
print (pandas.crosstab(sub4["polityscore"], sub4["lifeexpectancy5"]))
 print ("crosstab of polity score and postive lifeexpectancy, 10 yr grouping, polity score as columns")
print (pandas.crosstab(sub4["lifeexpectancy5"],sub4["polityscore"]))
 #crosstabs of lifeexpectacncy, 10 year grouping, within democracies
print ("crosstabs of lifeexpectancy 1o yr grouping and democracies, view with lifeexpectancy as columns")
print (pandas.crosstab(sub2["polityscore"], sub4["lifeexpectancy5"]))
 print ("crosstabs of lifeexpectancy 10 yr grouping and democracies, view with lifeexpectancy as rows")
print (pandas.crosstab(sub4["lifeexpectancy5"], sub2["polityscore"]))
The output (truncated after some intitial code
-9      4
-10     2
Name: polityscore, dtype: int64
percentages of polityscore - democracy score in 2009, numeric
0   0.028169
-2    0.023474
-3    0.028169
-4    0.028169
4    0.018779
5   0.032864
6   0.046948
7   0.061033
8   0.089202
9   0.070423
10   0.154930
11   0.244131
-1    0.018779
2   0.014085
3   0.009390
1   0.014085
-6    0.014085
-7    0.056338
-8    0.009390
-5    0.009390
-9    0.018779
-10   0.009390
Name: polityscore, dtype: float64
counts for democracy - polity score 6 or greater, positve life expectancy
6     10
7     13
8     19
9     15
10    32
Name: polityscore, dtype: int64
percentages of democratic govs - polity score 6 or greater, positive life expectancy
6    0.112360
7    0.146067
8    0.213483
9    0.168539
10   0.359551
Name: polityscore, dtype: float64
counts for lifeexpectancy - without missing values
63.125000    1
79.341000    1
49.553000    1
68.795000    1
58.582000    1
79.977000    1
58.199000    1
80.170000    1
81.012000    1
74.573000    1
70.124000    1
70.563000    1
76.954000    1
48.398000    1
68.944000    1
75.181000    1
81.126000    1
75.956000    1
69.317000    1
65.193000    1
80.557000    1
67.185000    1
73.990000    1
75.901000    1
82.759000    1
79.499000    1
61.597000    1
79.158000    1
71.017000    1
76.546000    1
           ..
73.703000    1
67.714000    1
51.093000    1
71.172000    1
73.456000    1
74.044000    1
78.005000    1
78.371000    1
76.640000    1
74.788000    1
76.652000    1
82.338000    1
80.499000    1
74.414000    1
75.670000    1
67.017000    1
61.452000    1
68.498000    1
73.127000    1
74.156000    1
75.620000    1
62.791000    1
72.832000    1
62.703000    1
68.749000    1
76.126000    1
81.539000    1
54.210000    1
57.379000    1
73.373000    1
Name: lifeexpectancy, Length: 189, dtype: int64
counts for lifeexpectancy - with missing values
63.125000     1
79.341000     1
0.000000     22
49.553000     1
68.795000     1
58.582000     1
79.977000     1
58.199000     1
80.170000     1
81.012000     1
74.573000     1
70.124000     1
70.563000     1
76.954000     1
48.398000     1
68.944000     1
75.181000     1
81.126000     1
75.956000     1
69.317000     1
65.193000     1
80.557000     1
67.185000     1
73.990000     1
75.901000     1
82.759000     1
79.499000     1
61.597000     1
79.158000     1
71.017000     1
            ..
73.703000     1
67.714000     1
51.093000     1
71.172000     1
73.456000     1
74.044000     1
78.005000     1
78.371000     1
76.640000     1
74.788000     1
76.652000     1
82.338000     1
80.499000     1
74.414000     1
75.670000     1
67.017000     1
61.452000     1
68.498000     1
73.127000     1
74.156000     1
75.620000     1
62.791000     1
72.832000     1
62.703000     1
68.749000     1
76.126000     1
81.539000     1
54.210000     1
57.379000     1
73.373000     1
Name: lifeexpectancy, Length: 190, dtype: int64
life expectancy, 4 categories or quartiles, all polity scores
1=25%tile     28
2=50%tile     26
3=75%tile     63
4=%100tile    74
Name: lifeexpectancy4, dtype: int64
lifeexpectancy   47.794000 48.132000  48.196000  48.397000 48.398000  \
lifeexpectancy4                                                        
1=25%tile                1          1          1          1          1  
2=50%tile                0          0          0          0          0  
3=75%tile                0          0          0          0          0  
4=%100tile               0          0          0          0          0  
 lifeexpectancy   48.673000 48.718000  49.025000  49.553000 50.239000  \
lifeexpectancy4                                                        
1=25%tile                1          1          1          1          1  
2=50%tile                0          0          0          0          0  
3=75%tile                0          0          0          0          0  
4=%100tile               0          0          0          0          0  
 lifeexpectancy     ...     81.404000  81.439000  81.539000 81.618000  \
lifeexpectancy4    ...                                                  
1=25%tile          ...              0          0          0          0  
2=50%tile          ...              0          0          0         0  
3=75%tile          ...              0          0          0          0  
4=%100tile         ...              1          1          1          1  
 lifeexpectancy   81.804000 81.855000  81.907000  82.338000 82.759000  \
lifeexpectancy4                                                        
1=25%tile                0          0          0          0          0  
2=50%tile                0          0          0          0          0  
3=75%tile                0          0         0          0          0  
4=%100tile               1          1          1          1          1  
 lifeexpectancy   83.394000
lifeexpectancy4            
1=25%tile                0  
2=50%tile                0  
3=75%tile                0  
4=%100tile               1  
 [4 rows x 189 columns]
(40, 50]     9
(50, 60]    29
(60, 70]    38
(70, 80]    92
(80, 90]    23
Name: lifeexpectancy5, dtype: int64
lifeexpectancy5  (40, 50] (50, 60]  (60, 70]  (70, 80] (80, 90]
lifeexpectancy4                                                  
1=25%tile               9        19         0         0         0
2=50%tile               0        10        16         0         0
3=75%tile               0         0        22        41         0
4=%100tile              0         0         0        51        23
lifeexpectancy   47.794000 48.132000  48.196000  48.397000 48.398000  \
lifeexpectancy5                                                        
(40, 50]                 1          1          1          1         1  
(50, 60]                 0          0          0          0          0  
(60, 70]                 0          0          0          0          0  
(70, 80]                 0          0          0          0          0  
(80, 90]                 0          0          0          0          0  
 lifeexpectancy   48.673000 48.718000  49.025000  49.553000 50.239000  \
lifeexpectancy5                                                        
(40, 50]                 1          1         1          1          0  
(50, 60]                 0          0          0          0          1  
(60, 70]                 0          0          0          0          0  
(70, 80]                 0          0          0          0          0  
(80, 90]                 0          0          0          0          0  
 lifeexpectancy     ...     81.404000  81.439000  81.539000 81.618000  \
lifeexpectancy5    ...                                                  
(40, 50]           ...              0          0          0          0  
(50, 60]           ...              0          0          0          0  
(60, 70]           ...              0          0          0          0  
(70, 80]           ...              0          0          0          0  
(80, 90]           ...              1          1          1          1  
 lifeexpectancy   81.804000 81.855000  81.907000  82.338000 82.759000  \
lifeexpectancy5                                                        
(40, 50]                 0         0          0          0          0  
(50, 60]                 0          0          0          0          0  
(60, 70]                 0          0          0          0          0  
(70, 80]                 0          0          0         0          0  
(80, 90]                 1          1          1          1          1  
 lifeexpectancy   83.394000
lifeexpectancy5            
(40, 50]                 0  
(50, 60]                 0  
(60, 70]                 0  
(70, 80]                 0  
(80, 90]                 1  
 [5 rows x 189 columns]
counts of polityscore in lifeexpectancy
0      6
-2      5
-3      6
-4      6
4      4
5      7
6     10
7     13
8     19
9     15
10    32
11    31
-1      4
2      3
3      2
1      3
-6      3
-7     12
-8      2
-5      2
-9      4
-10     2
Name: polityscore, dtype: int64
percentages of polityscore in life expectancy
0   0.031414
-2    0.026178
-3    0.031414
-4    0.031414
4   0.020942
5   0.036649
6   0.052356
7   0.068063
8   0.099476
9   0.078534
10   0.167539
11   0.162304
-1    0.020942
2   0.015707
3   0.010471
1   0.015707
-6    0.015707
-7    0.062827
-8    0.010471
-5    0.010471
-9    0.020942
-10   0.010471
Name: polityscore, dtype: float64
counts of polityscore in positive life expectancy,post coding 11 to be nan
10.000000     32
8.000000      19
5.000000       7
-3.000000      6
7.000000      13
-4.000000      6
6.000000      10
9.000000      15
nan           31
-10.000000     2
-6.000000      3
-9.000000      4
-7.000000     12
-8.000000      2
2.000000       3
-2.000000      5
0.000000       6
3.000000       2
1.000000       3
4.000000       4
-1.000000      4
-5.000000      2
Name: polityscore, dtype: int64
percentages of polity score in positive life expectancy, post coding 11 to be nan
10.000000    0.167539
8.000000     0.099476
5.000000     0.036649
-3.000000    0.031414
7.000000     0.068063
-4.000000    0.031414
6.000000     0.052356
9.000000     0.078534
nan          0.162304
-10.000000   0.010471
-6.000000    0.015707
-9.000000    0.020942
-7.000000    0.062827
-8.000000    0.010471
2.000000     0.015707
-2.000000    0.026178
0.000000     0.031414
3.000000     0.010471
1.000000     0.015707
4.000000     0.020942
-1.000000    0.020942
-5.000000    0.010471
Name: polityscore, dtype: float64
counts of polityscore in positive lifeexpectancy, no nan
10.000000     32
8.000000      19
5.000000       7
-3.000000      6
7.000000      13
-4.000000      6
6.000000      10
9.000000      15
-10.000000     2
-6.000000      3
-9.000000      4
-7.000000     12
-8.000000      2
2.000000       3
-2.000000      5
0.000000       6
3.000000       2
1.000000       3
4.000000       4
-1.000000      4
-5.000000      2
Name: polityscore, dtype: int64
percentages of polity sore in positive lifeexpectancy, no nan
10.000000    0.200000
8.000000     0.118750
5.000000     0.043750
-3.000000    0.037500
7.000000     0.081250
-4.000000    0.037500
6.000000     0.062500
9.000000     0.093750
-10.000000   0.012500
-6.000000    0.018750
-9.000000    0.025000
-7.000000    0.075000
-8.000000    0.012500
2.000000     0.018750
-2.000000    0.031250
0.000000     0.037500
3.000000     0.012500
1.000000     0.018750
4.000000     0.025000
-1.000000    0.025000
-5.000000    0.012500
Name: polityscore, dtype: float64
crosstab of positive lifeexpectancy, 10 yr grouping, and polity score, lifeexpectancy as columns
lifeexpectancy5  (40, 50] (50, 60]  (60, 70]  (70, 80] (80, 90]
polityscore                                                      
-10.000000              0         0         0         2         0
-9.000000               1         0         3         0         0
-8.000000               0         0         0         2         0
-7.000000               0         0         2        10         0
-6.000000               0         0         2         1         0
-5.000000               0         2         0         0         0
-4.000000               0         3         2         1         0
-3.000000               0         2         1         3         0
-2.000000               1         2         1         0         1
-1.000000               1         3         0         0         0
0.000000                1         3         2         0       0
1.000000                0         2         1         0         0
2.000000                0         1         1         1         0
3.000000                0         0         2         0         0
4.000000                0         1         2       1         0
5.000000                1         1         3         2         0
6.000000                1         3         3         3         0
7.000000                2         4         3         4         0
8.000000                1         1       5        10         2
9.000000                0         1         2        11         1
10.000000               0         0         1        16        15
crosstab of polity score and postive lifeexpectancy, 10 yr grouping, polity score as columns
polityscore      -10.000000  -9.000000   -8.000000   -7.000000   -6.000000   \
lifeexpectancy5                                                              
(40, 50]                  0           1           0           0           0  
(50, 60]                  0           0           0           0           0  
(60, 70]                  0           3           0           2           2  
(70, 80]                  2           0           2          10           1  
(80, 90]                  0           0           0           0           0  
 polityscore      -5.000000   -4.000000   -3.000000   -2.000000   -1.000000   \
lifeexpectancy5                                                              
(40, 50]                  0           0           0          1           1  
(50, 60]                  2           3           2           2           3  
(60, 70]                  0           2           1           1           0  
(70, 80]                  0           1           3           0           0  
(80, 90]                  0           0           0           1           0  
 polityscore         ...      1.000000    2.000000   3.000000    4.000000    \
lifeexpectancy5     ...                                                      
(40, 50]            ...               0           0           0           0  
(50, 60]            ...               2           1           0           1  
(60, 70]            ...               1           1           2           2  
(70, 80]            ...               0           1           0           1  
(80, 90]            ...               0           0           0           0  
 polityscore      5.000000    6.000000   7.000000    8.000000    9.000000   \
lifeexpectancy5                                                              
(40, 50]                  1           1           2           1           0  
(50, 60]                  1           3           4           1           1  
(60, 70]                  3           3           3           5           2  
(70, 80]                  2           3           4          10          11  
(80, 90]                  0           0           0           2           1  
 polityscore      10.000000  
lifeexpectancy5              
(40, 50]                  0  
(50, 60]                  0  
(60, 70]                  1  
(70, 80]                 16  
(80, 90]                 15  
 [5 rows x 21 columns]
crosstabs of lifeexpectancy 1o yr grouping and democracies, view with lifeexpectancy as columns
lifeexpectancy5  (40, 50] (50, 60]  (60, 70]  (70, 80] (80, 90]
polityscore                                                    
6.000000                1         3         3         3         0
7.000000                2         4         3         4         0
8.000000                1         1         5        10         2
9.000000                0         1         2        11         1
10.000000               0         0         1        16      15
crosstabs of lifeexpectancy 10 yr grouping and democracies, view with lifeexpectancy as rows
polityscore      6.000000   7.000000   8.000000   9.000000   10.000000
lifeexpectancy5                                                      
(40, 50]                 1          2          1          0          0
(50, 60]                 3          4          1          1          0
(60, 70]                 3          3          5          2          1
(70, 80]                 3          4         10         11         16
(80, 90]                 0          0          2          1         15
C:/Users/Tofu/Python Project Folder/Week 3.py:21: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
 data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
C:/Users/Tofu/Python Project Folder/Week 3.py:22: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
 data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True)
C:/Users/Tofu/Python Project Folder/Week 3.py:23: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
 data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True)
C:/Users/Tofu/Python Project Folder/Week 3.py:24: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
 data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True)
C:/Users/Tofu/Python Project Folder/Week 3.py:25: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
 data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True)
C:/Users/Tofu/Python Project Folder/Week 3.py:26: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
 data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
 So I ended filtering out the null values in my subsets. My first subset is to look section out democracies form the Gapminder group. I also added additional filter in democracies to exclude null values of life expectancy (aka any life expectancy value not equal to zero). Before filtering missing lifeexpectancy values, there were 90 observations for democratic governments; after, there were 89 observations remaining. Of those 89 observations, 36% were coded with a polity score of 10. When I poke in my excel version (just so that I can explore).
 I first initially created a subset of lifeexpectancy values for any valid values (aka, not equal to zero). However, we still have a huge range of values. So I first try the quartering the data set, which divide life expectancy to have 28 of 191 in 25th percentile, etc. However, even with a cross tab, this was not easily understood to me. So I decide to grouped it myself
Using excel, I found the life expectancy values start in the early 40s and end in mid 80’s. Initially, I tried doing 5 year groups, starting at 40, but I received an error that I could only at most 9 distinct groupings. So I decided to break up the data into 10 years groupings instead. 92 observations were between 70 and 80 years old, making up 48% of datasets regardless of polity score
When I crosstabed the life expectancy 10 year groupings and quartiles, I found 50 to 70’s to be in 50th percentile range. In other words, the median life expectancy in observed countries is between 50 and 70 years old.
However, the life expectancy subset has not removed missing polity scores. When I look at like at positive life expectancy and all polity scores, there are 31 observations or 16% with null values (as indicated with “11”) . So I converted the “11” polity scores in life expectancy subset to “nan“. When I exclude the “nan”, the observations dropped. This illustrative by the percentages of countries scoring 10 on the polity score: without null values, 20% of non-null life expectancies observations score as a 10 (32 observations) versus 17% of observation when we include null polity scores.
I ran some additional cross tabs on polity scores and life expectancy (in 10 year groups) to better visualize the data, but I found that it was not as insightful if I could visualize this in a chart.
0 notes
trsmit-blog · 8 years ago
Text
Week 2 - Let’s Get Programming
Below is my code. There was some code I removed, such as the percentages for alcohol consumption. I also had some code to make additional subset for anocracies (-6<x<6) and autocracies (x<-6) that was not working either. I will reach out to my comrades who know Python better to see syntax fixes I need on those.
———————–
The code:
# -*- coding: utf-8 -*- “”“ Created on Sun Aug 20 19:02:33 2017
@author: Tofu ”“” import pandas import numpy
data = pandas.read_csv(‘gapminder.csv’, low_memory=False)
#bug fix for display formats to avoid run time errors - put after loading data above pandas.set_option('display.float_format’, lambda x: ’%f’ %x)
#humber of observation and variables print(len(data)) print(len(data.columns))
#converting variables to numeric values data['polityscore’] = data['polityscore’].convert_objects(convert_numeric=True) data['lifeexpectancy’] = data['lifeexpectancy’].convert_objects(convert_numeric=True) data['alcconsumption’] = data['alcconsumption’].convert_objects(convert_numeric=True) data['co2emissions’] = data['co2emissions’].convert_objects(convert_numeric=True) data['relectricperperson’] = data['relectricperperson’].convert_objects(convert_numeric=True) data['hivrate’] = data['hivrate’].convert_objects(convert_numeric=True)
#counts and percentages of variables print (“counts of polityscore - democracy score in 2009, numeric”) c1 = data[“polityscore”].value_counts(sort=False) print (c1)
print (“percentages of polityscore - democracy score in 2009, numeric”) p1 = data[“polityscore”].value_counts(sort=False, normalize=True) print (p1)
print (“counts of lifeexpectancy - 2011 average life expectancy, years”) c2 = data[“lifeexpectancy”].value_counts(sort=False) print (c2)
print (“percentages of lifeexpectancy - 2011 average life expectancy, years”) p2 = data[“lifeexpectancy”].value_counts(sort=False, normalize=True) print (p2)
print (“counts of alcconsumption - 2008 alcohol consumption per adult, litres”) c3 = data[“alcconsumption”].value_counts(sort=False) print (c3)
print (“counts of co2emissions - cumulative emissions in 2006, metric tons”) c4 = data[“co2emissions”].value_counts(sort=False) print (c4)
print (“percentages of co2emissions - cumulative emissions in 2006, metric tons”) p4 = data[“co2emissions”].value_counts(sort=False, normalize=True) print (p4)
print (“counts of relectricperperson - residential electricity used per person in 2008, kWh”) c5 = data[“relectricperperson”].value_counts(sort=False) print (c5)
print (“percentages of relectricperperson - residential electricity used per person in 2008, kWh”) p5 = data[“relectricperperson”].value_counts(sort=False, normalize=True) print (p5)
print (“counts of hivrate -2009 estimated % of peeople aged 15 to 49 living with HIV”) c6 = data[“hivrate”].value_counts(sort=False) print (c6)
print (“percentages of hivrate - 2009 estimated % of people aged 15 to 49 living with HIV”) p6= data[“hivrate”].value_counts(sort=False, normalize=True) print (p6)
# Frequncy distribution using by group ct1=data.groupby(“polityscore”).size() print (ct1)
#making a subset of democracies sub1= data[(data[“polityscore”]>= 6)]
#make a copy of my democracy subset sub2= sub1.copy()
# frequency distributions of different governments print (“counts for democracy - polity score 6 or greater”) c7 = sub2[“polityscore”].value_counts(sort=False) print (c7)
print (“percentages of democracy - polity score 6 or greater”) p7 = sub2[“polityscore”].value_counts(sort=False, normalize=True) print (p7)
print (“counts of life expectancy in democracies”) c8 = sub2[“lifeexpectancy”].value_counts(sort=False, dropna=False) print (c8)
print (“percentages of life expectancy in democracies”) p8 = sub2[“lifeexpectancy”].value_counts(sort=False, normalize=True) print (p8)
——————–
the output (something weird happened to my output, so I had to rerun an intial subset of my code to get output)
counts of polityscore - democracy score in 2009, numeric 0.000000       6 9.000000      15 2.000000       3 -2.000000      5 8.000000      19 5.000000       7 10.000000     33 -7.000000     12 7.000000      13 3.000000       2 6.000000      10 -4.000000      6 -1.000000      4 -3.000000      6 -5.000000      2 1.000000       3 -6.000000      3 -9.000000      4 4.000000       4 -8.000000      2 -10.000000     2 Name: polityscore, dtype: int64 percentages of polityscore - democracy score in 2009, numeric 0.000000     0.037267 9.000000     0.093168 2.000000     0.018634 -2.000000    0.031056 8.000000     0.118012 5.000000     0.043478 10.000000    0.204969 -7.000000    0.074534 7.000000     0.080745 3.000000     0.012422 6.000000     0.062112 -4.000000    0.037267 -1.000000    0.024845 -3.000000    0.037267 -5.000000    0.012422 1.000000     0.018634 -6.000000    0.018634 -9.000000    0.024845 4.000000     0.024845 -8.000000    0.012422 -10.000000   0.012422 Name: polityscore, dtype: float64 counts of lifeexpectancy - 2011 average life expectancy, years 63.125000    1 79.591000    1 74.576000    1 62.475000    1 74.414000    1 79.977000    1 58.199000    1 75.670000    1 81.012000    1 72.283000    1 55.442000    1 81.855000    1 48.398000    1 68.944000    1 75.133000    1 76.126000    1 69.317000    1 65.193000    1 75.057000    1 77.685000    1 68.498000    1 62.465000    1 79.634000    1 73.911000    1 80.499000    1 61.597000    1 79.341000    1 71.017000    1 82.759000    1 68.978000    1            .. 76.954000    1 73.703000    1 79.839000    1 48.718000    1 71.172000    1 73.456000    1 48.397000    1 81.439000    1 75.246000    1 55.377000    1 74.788000    1 74.402000    1 82.338000    1 79.499000    1 81.539000    1 54.210000    1 67.017000    1 61.452000    1 73.373000    1 73.127000    1 69.245000    1 68.795000    1 72.832000    1 76.918000    1 57.937000    1 73.126000    1 64.666000    1 75.956000    1 57.379000    1 50.239000    1 Name: lifeexpectancy, Length: 189, dtype: int64 percentages of lifeexpectancy - 2011 average life expectancy, years 63.125000   0.005236 79.591000   0.005236 74.576000   0.005236 62.475000   0.005236 74.414000   0.005236 79.977000   0.005236 58.199000   0.005236 75.670000   0.005236 81.012000   0.005236 72.283000   0.005236 55.442000   0.005236 81.855000   0.005236 48.398000   0.005236 68.944000   0.005236 75.133000   0.005236 76.126000   0.005236 69.317000   0.005236 65.193000   0.005236 75.057000   0.005236 77.685000   0.005236 68.498000   0.005236 62.465000   0.005236 79.634000   0.005236 73.911000   0.005236 80.499000   0.005236 61.597000   0.005236 79.341000   0.005236 71.017000   0.005236 82.759000   0.005236 68.978000   0.005236
76.954000   0.005236 73.703000   0.005236 79.839000   0.005236 48.718000   0.005236 71.172000   0.005236 73.456000   0.005236 48.397000   0.005236 81.439000   0.005236 75.246000   0.005236 55.377000   0.005236 74.788000   0.005236 74.402000   0.005236 82.338000   0.005236 79.499000   0.005236 81.539000   0.005236 54.210000   0.005236 67.017000   0.005236 61.452000   0.005236 73.373000   0.005236 73.127000   0.005236 69.245000   0.005236 68.795000   0.005236 72.832000   0.005236 76.918000   0.005236 57.937000   0.005236 73.126000   0.005236 64.666000   0.005236 75.956000   0.005236 57.379000   0.005236 50.239000   0.005236 Name: lifeexpectancy, Length: 189, dtype: float64 counts of alcconsumption - 2008 alcohol consumption per adult, litres 15.000000    1 5.250000     1 3.990000     1 9.750000     1 0.500000     1 9.500000     1 6.560000     1 5.000000     1 4.990000     1 4.430000     1 11.010000    1 5.120000     1 7.790000     1 1.870000     1 5.920000     2 0.920000     1 3.020000     1 6.990000     1 12.050000    1 12.020000    1 3.610000     1 12.480000    1 0.280000     1 8.680000     1 0.520000     1 13.310000    1 11.410000    1 0.340000     2 9.720000     1 4.390000     1            .. 0.560000     1 7.300000     1 1.320000     1 6.420000     1 3.880000     1 10.620000    1 9.860000     1 8.550000     1 0.650000     1 10.710000    1 12.840000    1 1.290000     1 3.390000     2 10.080000    1 2.270000     1 9.460000     1 8.170000     1 1.030000     1 5.050000     1 6.660000     1 3.110000     1 7.320000     1 2.760000     1 1.640000     1 0.050000     1 16.300000    1 5.210000     1 0.320000     1 9.480000     1 8.690000     1
1718339333.333330     1 2251333.333333        1 72524250333.333298    1 248358000.000000      1 2329308666.666670     1 2401666.666667        1                     .. 21351000.000000       1 2335666.666667        1 13304503666.666700    1 1414031666.666670     1 95256333.333333       1 81191000.000000       1 511107666.666667      1 30800000.000000       1 7315000.000000        1 28490000.000000       1 1839471333.333330     1 127108666.666667      1 3157700333.333330     1 78943333.333333       1 236419333.333333      1 132025666.666667      1 1146277000.000000     1 1436893333.333330     1 5214000.000000        1 3503877666.666670     1 7813666.666667        1 33341634333.333302    1 4814333.333333        1 8231666.666667        1 7601000.000000        1 20152000.000000       1 149904333.333333      1 7861553333.333330     1 322960000.000000      1 35717000.000000       1 Name: co2emissions, Length: 200, dtype: int64 percentages of co2emissions - cumulative emissions in 2006, metric tons 4286590000.000000    0.005000 8092333.333333       0.005000 1045000.000000       0.005000 23404568000.000000   0.005000 5872119000.000000    0.005000 1548044666.666670    0.005000 9155666.666667       0.005000 277170666.666667     0.005000 29758666.666667      0.005000 119958666.666667     0.005000 850666.666667        0.005000 148470666.666667     0.005000 590219666.666666     0.005000 4200940333.333330    0.005000 340090666.666667     0.005000 1286670000.000000    0.005000 14058000.000000      0.005000 41229554666.666702   0.005000 598774000.000000     0.005000 377303666.666667     0.005000 7355333.333333       0.005000 26209333.333333      0.005000 2008116000.000000    0.005000 446365333.333333     0.005000 1718339333.333330    0.005000 2251333.333333       0.005000 72524250333.333298   0.005000 248358000.000000     0.005000 2329308666.666670    0.005000 2401666.666667       0.005000
21351000.000000      0.005000 2335666.666667       0.005000 13304503666.666700   0.005000 1414031666.666670    0.005000 95256333.333333      0.005000 81191000.000000      0.005000 511107666.666667     0.005000 30800000.000000      0.005000 7315000.000000       0.005000 28490000.000000      0.005000 1839471333.333330    0.005000 127108666.666667     0.005000 3157700333.333330    0.005000 78943333.333333      0.005000 236419333.333333     0.005000 132025666.666667     0.005000 1146277000.000000    0.005000 1436893333.333330    0.005000 5214000.000000       0.005000 3503877666.666670    0.005000 7813666.666667       0.005000 33341634333.333302   0.005000 4814333.333333       0.005000 8231666.666667       0.005000 7601000.000000       0.005000 20152000.000000      0.005000 149904333.333333     0.005000 7861553333.333330    0.005000 322960000.000000     0.005000 35717000.000000      0.005000 Name: co2emissions, Length: 200, dtype: float64 counts of relectricperperson - residential electricity used per person in 2008, kWh 0.000000       5 1920.962215    1 2826.044873    1 55.794744      1 2124.608816    1 528.648051     1 2993.092660    1 187.324882     1 1494.410268    1 15.056236      1 528.787350     1 825.941111     1 314.826200     1 1585.174739    1 1490.056909    1 186.925515     1 368.434606     1 1884.299342    1 815.031091     1 70.387444      1 59.551245      1 1933.945615    1 767.970324     1 913.845660     1 31.544564      1 2123.762863    1 51.581320      1 753.209802     1 921.562111     1 4036.953993    1              .. 304.940115     1 209.094517     1 41.180003      1 920.137600     1 1831.731848    1 1690.718434    1 168.623031     1 768.428300     1 614.907287     1 4759.453844    1 38.634503      1 1411.230532    1 532.515177     1 1142.309009    1 2261.316713    1 20.288131      1 256.099151     1 404.591365     1 590.509814     1 325.839561     1 3433.932449    1 636.341383     1 38.005637      1 31.386838      1 537.104738     1 7432.130852    1 351.166594     1 97.246492      1 9.192395       1 1259.392457    1 Name: relectricperperson, Length: 132, dtype: int64 percentages of relectricperperson - residential electricity used per person in 2008, kWh 0.000000      0.036765 1920.962215   0.007353 2826.044873   0.007353 55.794744     0.007353 2124.608816   0.007353 528.648051    0.007353 2993.092660   0.007353 187.324882    0.007353 1494.410268   0.007353 15.056236     0.007353 528.787350    0.007353 825.941111    0.007353 314.826200    0.007353 1585.174739   0.007353 1490.056909   0.007353 186.925515    0.007353 368.434606    0.007353 1884.299342   0.007353 815.031091    0.007353 70.387444     0.007353 59.551245     0.007353 1933.945615   0.007353 767.970324    0.007353 913.845660    0.007353 31.544564     0.007353 2123.762863   0.007353 51.581320     0.007353 753.209802    0.007353 921.562111    0.007353 4036.953993   0.007353
304.940115    0.007353 209.094517    0.007353 41.180003     0.007353 920.137600    0.007353 1831.731848   0.007353 1690.718434   0.007353 168.623031    0.007353 768.428300    0.007353 614.907287    0.007353 4759.453844   0.007353 38.634503     0.007353 1411.230532   0.007353 532.515177    0.007353 1142.309009   0.007353 2261.316713   0.007353 20.288131     0.007353 256.099151    0.007353 404.591365    0.007353 590.509814    0.007353 325.839561    0.007353 3433.932449   0.007353 636.341383    0.007353 38.005637     0.007353 31.386838     0.007353 537.104738    0.007353 7432.130852   0.007353 351.166594    0.007353 97.246492     0.007353 9.192395      0.007353 1259.392457   0.007353 Name: relectricperperson, Length: 132, dtype: float64 counts of hivrate -2009 estimated % of peeople aged 15 to 49 living with HIV 2.000000      2 0.500000      5 2.500000      2 5.000000      1 1.500000      2 11.000000     1 1.300000      2 1.000000      4 11.500000     1 6.500000      1 13.500000     1 3.600000      1 17.800000     1 3.200000      1 14.300000     1 1.400000      1 0.100000     28 2.300000      1 3.300000      1 1.900000      1 0.700000      3 25.900000     1 5.600000      1 0.200000     15 0.400000      9 0.060000     16 0.800000      5 0.300000     10 3.100000      1 1.200000      4 5.300000      1 24.800000     1 3.400000      3 1.700000      1 23.600000     1 6.300000      1 0.450000      1 0.600000      3 13.100000     1 4.700000      1 2.900000      1 1.600000      1 0.900000      4 5.200000      1 1.100000      2 1.800000      1 Name: hivrate, dtype: int64 percentages of hivrate - 2009 estimated % of people aged 15 to 49 living with HIV 2.000000    0.013605 0.500000    0.034014 2.500000    0.013605 5.000000    0.006803 1.500000    0.013605 11.000000   0.006803 1.300000    0.013605 1.000000    0.027211 11.500000   0.006803 6.500000    0.006803 13.500000   0.006803 3.600000    0.006803 17.800000   0.006803 3.200000    0.006803 14.300000   0.006803 1.400000    0.006803 0.100000    0.190476 2.300000    0.006803 3.300000    0.006803 1.900000    0.006803 0.700000    0.020408 25.900000   0.006803 5.600000    0.006803 0.200000    0.102041 0.400000    0.061224 0.060000    0.108844 0.800000    0.034014 0.300000    0.068027 3.100000    0.006803 1.200000    0.027211 5.300000    0.006803 24.800000   0.006803 3.400000    0.020408 1.700000    0.006803 23.600000   0.006803 6.300000    0.006803 0.450000    0.006803 0.600000    0.020408 13.100000   0.006803 4.700000    0.006803 2.900000    0.006803 1.600000    0.006803 0.900000    0.027211 5.200000    0.006803 1.100000    0.013605 1.800000    0.006803 Name: hivrate, dtype: float64 polityscore -10.000000     2 -9.000000      4 -8.000000      2 -7.000000     12 -6.000000      3 -5.000000      2 -4.000000      6 -3.000000      6 -2.000000      5 -1.000000      4 0.000000       6 1.000000       3 2.000000       3 3.000000       2 4.000000       4 5.000000       7 6.000000      10 7.000000      13 8.000000      19 9.000000      15 10.000000     33 dtype: int64 counts for democracy - polity score 6 or greater 9.000000     15 8.000000     19 10.000000    33 7.000000     13 6.000000     10 Name: polityscore, dtype: int64 percentages of democracy - polity score 6 or greater 9.000000    0.166667 8.000000    0.211111 10.000000   0.366667 7.000000    0.144444 6.000000    0.111111 Name: polityscore, dtype: float64 counts of life expectancy in democracies nan          1 79.591000    1 77.005000    1 79.915000    1 48.196000    1 80.642000    1 74.414000    1 79.977000    1 74.847000    1 81.012000    1 79.499000    1 81.855000    1 80.654000    1 80.734000    1 69.317000    1 80.557000    1 79.120000    1 68.498000    1 66.618000    1 75.446000    1 81.097000    1 81.618000    1 79.341000    1 74.825000    1 67.852000    1 62.465000    1 73.737000    1 73.488000    1 74.221000    1 81.907000    1            .. 73.703000    1 73.339000    1 71.172000    1 81.439000    1 74.044000    1 76.126000    1 68.494000    1 73.371000    1 72.640000    1 80.170000    1 78.531000    1 57.134000    1 51.444000    1 81.539000    1 68.795000    1 74.522000    1 81.404000    1 73.373000    1 73.127000    1 83.394000    1 72.231000    1 54.210000    1 64.228000    1 76.918000    1 68.749000    1 73.126000    1 80.414000    1 65.438000    1 80.854000    1 73.396000    1 Name: lifeexpectancy, Length: 89, dtype: int64 percentages of life expectancy in democracies 79.591000   0.011236 77.005000   0.011236 79.915000   0.011236 48.196000   0.011236 80.642000   0.011236 74.414000   0.011236 79.977000   0.011236 74.847000   0.011236 81.012000   0.011236 79.499000   0.011236 81.855000   0.011236 80.654000   0.011236 80.734000   0.011236 69.317000   0.011236 80.557000   0.011236 79.120000   0.011236 68.498000   0.011236 66.618000   0.011236 75.446000   0.011236 81.097000   0.011236 81.618000   0.011236 79.341000   0.011236 74.825000   0.011236 67.852000   0.011236 62.465000   0.011236 73.737000   0.011236 73.488000   0.011236 74.221000   0.011236 81.907000   0.011236 69.366000   0.011236
73.703000   0.011236 73.339000   0.011236 71.172000   0.011236 81.439000   0.011236 74.044000   0.011236 76.126000   0.011236 68.494000   0.011236 73.371000   0.011236 72.640000   0.011236 80.170000   0.011236 78.531000   0.011236 57.134000   0.011236 51.444000   0.011236 81.539000   0.011236 68.795000   0.011236 74.522000   0.011236 81.404000   0.011236 73.373000   0.011236 73.127000   0.011236 83.394000   0.011236 72.231000   0.011236 54.210000   0.011236 64.228000   0.011236 76.918000   0.011236 68.749000   0.011236 73.126000   0.011236 80.414000   0.011236 65.438000   0.011236 80.854000   0.011236 73.396000   0.011236 Name: lifeexpectancy, Length: 88, dtype: float64 C:/Users/Tofu/Python Project Folder/Week 2_amended.py:20: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['polityscore’] = data['polityscore’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:21: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['lifeexpectancy’] = data['lifeexpectancy’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:22: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['alcconsumption’] = data['alcconsumption’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:23: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['co2emissions’] = data['co2emissions’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:24: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['relectricperperson’] = data['relectricperperson’].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:25: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['hivrate’] = data['hivrate’].convert_objects(convert_numeric=True)
------------------------ written univariate analysis--------------
Polityscore is one of my main variables of interest. The Polity IV Project (http://www.systemicpeace.org/polityproject.html) identifies democracy with a score greater than 5, anocracies between 6 and -6, and autocracies -6 or below. I used excel to help with better refine the results above
   The Gap Minder has 213 observations, 52 of which that lack values. Democracies compose of 90 observations or 42.5% of dataset. There are 48 observations for anocracies, which make up 22.5% of data. The remainder is autocracies, making up 23 observations and 10.80% of the dataset.
The other variable of main interest was life expectancies, which is a continuous and positive variable. This was a huge pivot table in Excel with values between 47.94 and 83.94 years. Out of the 213 observations, there were 22 observations with no life expectancy value. Since some few of values have more than 1 values, percentages of any non-missing values was less than 1%.
Combining both as seen in the below pivot (should also be in the output above), we see that the average life expectancy by polity score ranges from 53.7 years (polity score -1) to 78.8 years for polity score 10. Considering the range of values, it is not explicitly clear if democracies will have longer life expectancies.
 Another variable we are looking is HIV rate, which is defined at the percentage of population living with HIV, for adults aged between 15 and 49 years old. There 66 observations in 213 observations with no value. The remaining 69% of data sets ranges from .06 % to 25.49%  of a country population living with HIV.
0 notes
trsmit-blog · 8 years ago
Text
Week 2 - Let’s Get Programming
ThereBelow is my code. There was some code I removed, such as the percentages for alcohol consumption. I also had some code to make subset for  for anocracies (-6<x<6) and autocracies (x<-6) that was not working either. I will reach out to my comrades who know Python better to see syntax fixes I need on those.
-----------------------
The code:
# -*- coding: utf-8 -*- """ Created on Sun Aug 20 19:02:33 2017
@author: Tofu """ import pandas import numpy
data = pandas.read_csv('gapminder.csv', low_memory=False)
#bug fix for display formats to avoid run time errors - put after loading data above pandas.set_option('display.float_format', lambda x: '%f' %x)
#humber of observation and variables print(len(data)) print(len(data.columns))
#converting variables to numeric values data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True) data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True) data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True) data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
#counts and percentages of variables print ("counts of polityscore - democracy score in 2009, numeric") c1 = data["polityscore"].value_counts(sort=False) print (c1)
print ("percentages of polityscore - democracy score in 2009, numeric") p1 = data["polityscore"].value_counts(sort=False, normalize=True) print (p1)
print ("counts of lifeexpectancy - 2011 average life expectancy, years") c2 = data["lifeexpectancy"].value_counts(sort=False) print (c2)
print ("percentages of lifeexpectancy - 2011 average life expectancy, years") p2 = data["lifeexpectancy"].value_counts(sort=False, normalize=True) print (p2)
print ("counts of alcconsumption - 2008 alcohol consumption per adult, litres") c3 = data["alcconsumption"].value_counts(sort=False) print (c3)
print ("counts of co2emissions - cumulative emissions in 2006, metric tons") c4 = data["co2emissions"].value_counts(sort=False) print (c4)
print ("percentages of co2emissions - cumulative emissions in 2006, metric tons") p4 = data["co2emissions"].value_counts(sort=False, normalize=True) print (p4)
print ("counts of relectricperperson - residential electricity used per person in 2008, kWh") c5 = data["relectricperperson"].value_counts(sort=False) print (c5)
print ("percentages of relectricperperson - residential electricity used per person in 2008, kWh") p5 = data["relectricperperson"].value_counts(sort=False, normalize=True) print (p5)
print ("counts of hivrate -2009 estimated % of peeople aged 15 to 49 living with HIV") c6 = data["hivrate"].value_counts(sort=False) print (c6)
print ("percentages of hivrate - 2009 estimated % of people aged 15 to 49 living with HIV") p6= data["hivrate"].value_counts(sort=False, normalize=True) print (p6)
# Frequncy distribution using by group ct1=data.groupby("polityscore").size() print (ct1)
#making a subset of democracies sub1= data[(data["polityscore"]>= 6)]
#make a copy of my democracy subset sub2= sub1.copy()
# frequency distributions of different governments print ("counts for democracy - polity score 6 or greater") c7 = sub2["polityscore"].value_counts(sort=False) print (c7)
print ("percentages of democracy - polity score 6 or greater") p7 = sub2["polityscore"].value_counts(sort=False, normalize=True) print (p7)
print ("counts of life expectancy in democracies") c8 = sub2["lifeexpectancy"].value_counts(sort=False, dropna=False) print (c8)
print ("percentages of life expectancy in democracies") p8 = sub2["lifeexpectancy"].value_counts(sort=False, normalize=True) print (p8)
--------------------
the output (something weird happened to my output, so I had to rerun an intial subset of my code to get output)
counts of polityscore - democracy score in 2009, numeric 0.000000       6 9.000000      15 2.000000       3 -2.000000      5 8.000000      19 5.000000       7 10.000000     33 -7.000000     12 7.000000      13 3.000000       2 6.000000      10 -4.000000      6 -1.000000      4 -3.000000      6 -5.000000      2 1.000000       3 -6.000000      3 -9.000000      4 4.000000       4 -8.000000      2 -10.000000     2 Name: polityscore, dtype: int64 percentages of polityscore - democracy score in 2009, numeric 0.000000     0.037267 9.000000     0.093168 2.000000     0.018634 -2.000000    0.031056 8.000000     0.118012 5.000000     0.043478 10.000000    0.204969 -7.000000    0.074534 7.000000     0.080745 3.000000     0.012422 6.000000     0.062112 -4.000000    0.037267 -1.000000    0.024845 -3.000000    0.037267 -5.000000    0.012422 1.000000     0.018634 -6.000000    0.018634 -9.000000    0.024845 4.000000     0.024845 -8.000000    0.012422 -10.000000   0.012422 Name: polityscore, dtype: float64 counts of lifeexpectancy - 2011 average life expectancy, years 63.125000    1 79.591000    1 74.576000    1 62.475000    1 74.414000    1 79.977000    1 58.199000    1 75.670000    1 81.012000    1 72.283000    1 55.442000    1 81.855000    1 48.398000    1 68.944000    1 75.133000    1 76.126000    1 69.317000    1 65.193000    1 75.057000    1 77.685000    1 68.498000    1 62.465000    1 79.634000    1 73.911000    1 80.499000    1 61.597000    1 79.341000    1 71.017000    1 82.759000    1 68.978000    1            .. 76.954000    1 73.703000    1 79.839000    1 48.718000    1 71.172000    1 73.456000    1 48.397000    1 81.439000    1 75.246000    1 55.377000    1 74.788000    1 74.402000    1 82.338000    1 79.499000    1 81.539000    1 54.210000    1 67.017000    1 61.452000    1 73.373000    1 73.127000    1 69.245000    1 68.795000    1 72.832000    1 76.918000    1 57.937000    1 73.126000    1 64.666000    1 75.956000    1 57.379000    1 50.239000    1 Name: lifeexpectancy, Length: 189, dtype: int64 percentages of lifeexpectancy - 2011 average life expectancy, years 63.125000   0.005236 79.591000   0.005236 74.576000   0.005236 62.475000   0.005236 74.414000   0.005236 79.977000   0.005236 58.199000   0.005236 75.670000   0.005236 81.012000   0.005236 72.283000   0.005236 55.442000   0.005236 81.855000   0.005236 48.398000   0.005236 68.944000   0.005236 75.133000   0.005236 76.126000   0.005236 69.317000   0.005236 65.193000   0.005236 75.057000   0.005236 77.685000   0.005236 68.498000   0.005236 62.465000   0.005236 79.634000   0.005236 73.911000   0.005236 80.499000   0.005236 61.597000   0.005236 79.341000   0.005236 71.017000   0.005236 82.759000   0.005236 68.978000   0.005236
76.954000   0.005236 73.703000   0.005236 79.839000   0.005236 48.718000   0.005236 71.172000   0.005236 73.456000   0.005236 48.397000   0.005236 81.439000   0.005236 75.246000   0.005236 55.377000   0.005236 74.788000   0.005236 74.402000   0.005236 82.338000   0.005236 79.499000   0.005236 81.539000   0.005236 54.210000   0.005236 67.017000   0.005236 61.452000   0.005236 73.373000   0.005236 73.127000   0.005236 69.245000   0.005236 68.795000   0.005236 72.832000   0.005236 76.918000   0.005236 57.937000   0.005236 73.126000   0.005236 64.666000   0.005236 75.956000   0.005236 57.379000   0.005236 50.239000   0.005236 Name: lifeexpectancy, Length: 189, dtype: float64 counts of alcconsumption - 2008 alcohol consumption per adult, litres 15.000000    1 5.250000     1 3.990000     1 9.750000     1 0.500000     1 9.500000     1 6.560000     1 5.000000     1 4.990000     1 4.430000     1 11.010000    1 5.120000     1 7.790000     1 1.870000     1 5.920000     2 0.920000     1 3.020000     1 6.990000     1 12.050000    1 12.020000    1 3.610000     1 12.480000    1 0.280000     1 8.680000     1 0.520000     1 13.310000    1 11.410000    1 0.340000     2 9.720000     1 4.390000     1            .. 0.560000     1 7.300000     1 1.320000     1 6.420000     1 3.880000     1 10.620000    1 9.860000     1 8.550000     1 0.650000     1 10.710000    1 12.840000    1 1.290000     1 3.390000     2 10.080000    1 2.270000     1 9.460000     1 8.170000     1 1.030000     1 5.050000     1 6.660000     1 3.110000     1 7.320000     1 2.760000     1 1.640000     1 0.050000     1 16.300000    1 5.210000     1 0.320000     1 9.480000     1 8.690000     1
1718339333.333330     1 2251333.333333        1 72524250333.333298    1 248358000.000000      1 2329308666.666670     1 2401666.666667        1                     .. 21351000.000000       1 2335666.666667        1 13304503666.666700    1 1414031666.666670     1 95256333.333333       1 81191000.000000       1 511107666.666667      1 30800000.000000       1 7315000.000000        1 28490000.000000       1 1839471333.333330     1 127108666.666667      1 3157700333.333330     1 78943333.333333       1 236419333.333333      1 132025666.666667      1 1146277000.000000     1 1436893333.333330     1 5214000.000000        1 3503877666.666670     1 7813666.666667        1 33341634333.333302    1 4814333.333333        1 8231666.666667        1 7601000.000000        1 20152000.000000       1 149904333.333333      1 7861553333.333330     1 322960000.000000      1 35717000.000000       1 Name: co2emissions, Length: 200, dtype: int64 percentages of co2emissions - cumulative emissions in 2006, metric tons 4286590000.000000    0.005000 8092333.333333       0.005000 1045000.000000       0.005000 23404568000.000000   0.005000 5872119000.000000    0.005000 1548044666.666670    0.005000 9155666.666667       0.005000 277170666.666667     0.005000 29758666.666667      0.005000 119958666.666667     0.005000 850666.666667        0.005000 148470666.666667     0.005000 590219666.666666     0.005000 4200940333.333330    0.005000 340090666.666667     0.005000 1286670000.000000    0.005000 14058000.000000      0.005000 41229554666.666702   0.005000 598774000.000000     0.005000 377303666.666667     0.005000 7355333.333333       0.005000 26209333.333333      0.005000 2008116000.000000    0.005000 446365333.333333     0.005000 1718339333.333330    0.005000 2251333.333333       0.005000 72524250333.333298   0.005000 248358000.000000     0.005000 2329308666.666670    0.005000 2401666.666667       0.005000
21351000.000000      0.005000 2335666.666667       0.005000 13304503666.666700   0.005000 1414031666.666670    0.005000 95256333.333333      0.005000 81191000.000000      0.005000 511107666.666667     0.005000 30800000.000000      0.005000 7315000.000000       0.005000 28490000.000000      0.005000 1839471333.333330    0.005000 127108666.666667     0.005000 3157700333.333330    0.005000 78943333.333333      0.005000 236419333.333333     0.005000 132025666.666667     0.005000 1146277000.000000    0.005000 1436893333.333330    0.005000 5214000.000000       0.005000 3503877666.666670    0.005000 7813666.666667       0.005000 33341634333.333302   0.005000 4814333.333333       0.005000 8231666.666667       0.005000 7601000.000000       0.005000 20152000.000000      0.005000 149904333.333333     0.005000 7861553333.333330    0.005000 322960000.000000     0.005000 35717000.000000      0.005000 Name: co2emissions, Length: 200, dtype: float64 counts of relectricperperson - residential electricity used per person in 2008, kWh 0.000000       5 1920.962215    1 2826.044873    1 55.794744      1 2124.608816    1 528.648051     1 2993.092660    1 187.324882     1 1494.410268    1 15.056236      1 528.787350     1 825.941111     1 314.826200     1 1585.174739    1 1490.056909    1 186.925515     1 368.434606     1 1884.299342    1 815.031091     1 70.387444      1 59.551245      1 1933.945615    1 767.970324     1 913.845660     1 31.544564      1 2123.762863    1 51.581320      1 753.209802     1 921.562111     1 4036.953993    1              .. 304.940115     1 209.094517     1 41.180003      1 920.137600     1 1831.731848    1 1690.718434    1 168.623031     1 768.428300     1 614.907287     1 4759.453844    1 38.634503      1 1411.230532    1 532.515177     1 1142.309009    1 2261.316713    1 20.288131      1 256.099151     1 404.591365     1 590.509814     1 325.839561     1 3433.932449    1 636.341383     1 38.005637      1 31.386838      1 537.104738     1 7432.130852    1 351.166594     1 97.246492      1 9.192395       1 1259.392457    1 Name: relectricperperson, Length: 132, dtype: int64 percentages of relectricperperson - residential electricity used per person in 2008, kWh 0.000000      0.036765 1920.962215   0.007353 2826.044873   0.007353 55.794744     0.007353 2124.608816   0.007353 528.648051    0.007353 2993.092660   0.007353 187.324882    0.007353 1494.410268   0.007353 15.056236     0.007353 528.787350    0.007353 825.941111    0.007353 314.826200    0.007353 1585.174739   0.007353 1490.056909   0.007353 186.925515    0.007353 368.434606    0.007353 1884.299342   0.007353 815.031091    0.007353 70.387444     0.007353 59.551245     0.007353 1933.945615   0.007353 767.970324    0.007353 913.845660    0.007353 31.544564     0.007353 2123.762863   0.007353 51.581320     0.007353 753.209802    0.007353 921.562111    0.007353 4036.953993   0.007353
304.940115    0.007353 209.094517    0.007353 41.180003     0.007353 920.137600    0.007353 1831.731848   0.007353 1690.718434   0.007353 168.623031    0.007353 768.428300    0.007353 614.907287    0.007353 4759.453844   0.007353 38.634503     0.007353 1411.230532   0.007353 532.515177    0.007353 1142.309009   0.007353 2261.316713   0.007353 20.288131     0.007353 256.099151    0.007353 404.591365    0.007353 590.509814    0.007353 325.839561    0.007353 3433.932449   0.007353 636.341383    0.007353 38.005637     0.007353 31.386838     0.007353 537.104738    0.007353 7432.130852   0.007353 351.166594    0.007353 97.246492     0.007353 9.192395      0.007353 1259.392457   0.007353 Name: relectricperperson, Length: 132, dtype: float64 counts of hivrate -2009 estimated % of peeople aged 15 to 49 living with HIV 2.000000      2 0.500000      5 2.500000      2 5.000000      1 1.500000      2 11.000000     1 1.300000      2 1.000000      4 11.500000     1 6.500000      1 13.500000     1 3.600000      1 17.800000     1 3.200000      1 14.300000     1 1.400000      1 0.100000     28 2.300000      1 3.300000      1 1.900000      1 0.700000      3 25.900000     1 5.600000      1 0.200000     15 0.400000      9 0.060000     16 0.800000      5 0.300000     10 3.100000      1 1.200000      4 5.300000      1 24.800000     1 3.400000      3 1.700000      1 23.600000     1 6.300000      1 0.450000      1 0.600000      3 13.100000     1 4.700000      1 2.900000      1 1.600000      1 0.900000      4 5.200000      1 1.100000      2 1.800000      1 Name: hivrate, dtype: int64 percentages of hivrate - 2009 estimated % of people aged 15 to 49 living with HIV 2.000000    0.013605 0.500000    0.034014 2.500000    0.013605 5.000000    0.006803 1.500000    0.013605 11.000000   0.006803 1.300000    0.013605 1.000000    0.027211 11.500000   0.006803 6.500000    0.006803 13.500000   0.006803 3.600000    0.006803 17.800000   0.006803 3.200000    0.006803 14.300000   0.006803 1.400000    0.006803 0.100000    0.190476 2.300000    0.006803 3.300000    0.006803 1.900000    0.006803 0.700000    0.020408 25.900000   0.006803 5.600000    0.006803 0.200000    0.102041 0.400000    0.061224 0.060000    0.108844 0.800000    0.034014 0.300000    0.068027 3.100000    0.006803 1.200000    0.027211 5.300000    0.006803 24.800000   0.006803 3.400000    0.020408 1.700000    0.006803 23.600000   0.006803 6.300000    0.006803 0.450000    0.006803 0.600000    0.020408 13.100000   0.006803 4.700000    0.006803 2.900000    0.006803 1.600000    0.006803 0.900000    0.027211 5.200000    0.006803 1.100000    0.013605 1.800000    0.006803 Name: hivrate, dtype: float64 polityscore -10.000000     2 -9.000000      4 -8.000000      2 -7.000000     12 -6.000000      3 -5.000000      2 -4.000000      6 -3.000000      6 -2.000000      5 -1.000000      4 0.000000       6 1.000000       3 2.000000       3 3.000000       2 4.000000       4 5.000000       7 6.000000      10 7.000000      13 8.000000      19 9.000000      15 10.000000     33 dtype: int64 counts for democracy - polity score 6 or greater 9.000000     15 8.000000     19 10.000000    33 7.000000     13 6.000000     10 Name: polityscore, dtype: int64 percentages of democracy - polity score 6 or greater 9.000000    0.166667 8.000000    0.211111 10.000000   0.366667 7.000000    0.144444 6.000000    0.111111 Name: polityscore, dtype: float64 counts of life expectancy in democracies nan          1 79.591000    1 77.005000    1 79.915000    1 48.196000    1 80.642000    1 74.414000    1 79.977000    1 74.847000    1 81.012000    1 79.499000    1 81.855000    1 80.654000    1 80.734000    1 69.317000    1 80.557000    1 79.120000    1 68.498000    1 66.618000    1 75.446000    1 81.097000    1 81.618000    1 79.341000    1 74.825000    1 67.852000    1 62.465000    1 73.737000    1 73.488000    1 74.221000    1 81.907000    1            .. 73.703000    1 73.339000    1 71.172000    1 81.439000    1 74.044000    1 76.126000    1 68.494000    1 73.371000    1 72.640000    1 80.170000    1 78.531000    1 57.134000    1 51.444000    1 81.539000    1 68.795000    1 74.522000    1 81.404000    1 73.373000    1 73.127000    1 83.394000    1 72.231000    1 54.210000    1 64.228000    1 76.918000    1 68.749000    1 73.126000    1 80.414000    1 65.438000    1 80.854000    1 73.396000    1 Name: lifeexpectancy, Length: 89, dtype: int64 percentages of life expectancy in democracies 79.591000   0.011236 77.005000   0.011236 79.915000   0.011236 48.196000   0.011236 80.642000   0.011236 74.414000   0.011236 79.977000   0.011236 74.847000   0.011236 81.012000   0.011236 79.499000   0.011236 81.855000   0.011236 80.654000   0.011236 80.734000   0.011236 69.317000   0.011236 80.557000   0.011236 79.120000   0.011236 68.498000   0.011236 66.618000   0.011236 75.446000   0.011236 81.097000   0.011236 81.618000   0.011236 79.341000   0.011236 74.825000   0.011236 67.852000   0.011236 62.465000   0.011236 73.737000   0.011236 73.488000   0.011236 74.221000   0.011236 81.907000   0.011236 69.366000   0.011236
73.703000   0.011236 73.339000   0.011236 71.172000   0.011236 81.439000   0.011236 74.044000   0.011236 76.126000   0.011236 68.494000   0.011236 73.371000   0.011236 72.640000   0.011236 80.170000   0.011236 78.531000   0.011236 57.134000   0.011236 51.444000   0.011236 81.539000   0.011236 68.795000   0.011236 74.522000   0.011236 81.404000   0.011236 73.373000   0.011236 73.127000   0.011236 83.394000   0.011236 72.231000   0.011236 54.210000   0.011236 64.228000   0.011236 76.918000   0.011236 68.749000   0.011236 73.126000   0.011236 80.414000   0.011236 65.438000   0.011236 80.854000   0.011236 73.396000   0.011236 Name: lifeexpectancy, Length: 88, dtype: float64 C:/Users/Tofu/Python Project Folder/Week 2_amended.py:20: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:21: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:22: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['alcconsumption'] = data['alcconsumption'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:23: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['co2emissions'] = data['co2emissions'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:24: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['relectricperperson'] = data['relectricperperson'].convert_objects(convert_numeric=True) C:/Users/Tofu/Python Project Folder/Week 2_amended.py:25: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.  data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True)
0 notes
trsmit-blog · 8 years ago
Text
Data Management and Visualization Week 1 -
Tumblr media
0 notes