freshcolortidalwave-blog - Tumblr blog

freshcolortidalwave-blog · 8 years ago

Text

Regression Modeling - Week 4

Our hypothesis is that there is a relationship between the armed forces rate of a country and its polity score. Polity score is a measure of how autocratic or democratic a country is, with scores ranging from -10 (autocratic) to +10 (democratic).

We will be binning the polity score into two groups, those with less than 0 polity scores and those with zero or greater scores so a logistic regression can be performed on the resulting binary variable called polity grouping. We will examine whether the armed forces rate is a significant contributor to the polity grouping and will look for other confounding variables in the data set (Gapminder). Specifically, we will examine whether the income per person, urban rate, residential electric rate, or the employment rate are confounding factors or otherwise have a significant impact on the polity grouping. All explanatory variables were centered at zero for the analysis.

First, we ran a logistic regression with the centered armed forces rate as the explanatory variable with the following results:

We can see that the centered armed forces rate (armed_c) had a significant association (OR = 0.67, 95% CI = 0.512-0.866, p=0.0024) with the polity grouping. Since the odds ratio is less than 1, this means as the armed forces rate increases, a polity grouping of greater than zero is less likely.

Next, we will add the centered income per person (income_c) to determine if there is a confounding effect or whether it can improve our model.

Given the p-value, 95% confidence interval, and odds ratio (OR = 1, 95% CI = 1 to 1, p=0.8021), we can conclude that income per person is statistically not significant and also does not have a confounding impact.

Next, we will add the centered residential electric per person (electric_c):

Again, given the p-value, 95% confidence interval, and odds ratio (OR = 1, 95% CI = 1 to 1, p=0.3424), we can conclude that the residential electric rate is statistically not significant and also does not have a confounding impact.

We will next examine the impact of the centered urban rate (urban_c) on the polity grouping:

Again, given the p-value, 95% confidence interval, and odds ratio (OR = 1.002, 95% CI = 0.982 to 1.023, p=0.8461), we can conclude that the urban rate is statistically not significant and also does not have a confounding impact.

Finally, we will next examine the impact of the centered employment rate (employ_c) on the polity grouping:

Again, given the p-value, 95% confidence interval, and odds ratio (OR = 0.974, 95% CI = 0.929 to 1.020, p=0.2629), we can conclude that the employment rate is statistically not significant and also does not have a confounding impact.

In summary, the original logistic model supports are hypothesis that there is a significant relationship between the armed forces rate and the polity score grouping. We tested income per person, the residential electric rate, the urban rate, and the employment rate and none of these showed a significant association with the polity score grouping and they were not shown to be confounding variables.

The full output of the armed forces rate model follows for reference:

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Regression Modeling - Week 3

This week we’ll use multiple regression to examine whether there is a relationship between the explanatory variables urban rate, residential electric per person, the armed forces rate, the employment rate, and the response variable income per person. Our hypothesis is that urban rate is the primary explanatory variable. All explanatory variables were centered at zero prior to analysis.

Our initial guess is that there is some curvilinear relationship between urban rate and income per person based on the plot. Using urban rate and urban rate squared as the explanatory variable yields the following model:

The R-squared value is 0.378 implies that urban rate and urban rate squared account for 38% of the variability in income per person. Both of these have p-values less than 0.05 indicating they are statistically significant. Taking the beta values from the table, this implies the equation for the model is:

income per person = 350.8 * urban rate + 5.2 * (urban rate * urban rate)

Adding the employment rate to the model, shows that the employment rate is not significant given its p-value of greater than 0.05

Adding the armed forces rate was significant given its p-value of less than 0.05

Finally, adding the residential electric rate was also significant with a p-value of less than 0.05 and with an R-square of 0.57, indicating that the model explains 57% of the variance in income per person. The step by step adding of variables show no confounding of variables. An R-Square of 0.57 supports our hypothesis that urban rate has a significant impact on income per person.

As we can see from the Q-Q plot, the deviance from the straight line indicates their may be other explanatory variables that need to be included to explain the curvilinear fit.

From the standardized residual plot we can see that many countries are within -2 to +2 standard deviations of the mean, where we would expect 95% of the observations if it was normally distributed. However there are several observations greater than 2 and a few greater than 3, indicating extreme outliers. Given that there are 127 observations, we would expect less than 7 observations to be outside of the range -2 to +2. There are 7 observations outside, indicating the model is a poor fit and there may be other explanatory variables.

Finally, looking at the leverage plot indicates that there are many observations that are outliers and many observations that have high leverage (leverage scale is 0 to 1) which has a strong influence on the estimate of the regression parameters. This is an indication of poor fit.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Machine Learning - Week 4

This week we’re examining clustering in the gapminder dataset. We selected the following variables to include in our cluster analysis: internetuserate polityscore armedforcesrate incomeperperson lifeexpectancy relectricperperson urbanrate. Since there are only 122 observations when missing data is excluded, we did not separate the dataset into training and test sets due to the small number of observations.

Note: The code (details at the bottom of post) produces many output tables. Only the relevant portions of the output were included here.

From the elbow curve of the 9 clusters, we can see that the bends in the line at 2, 3, 5, and 7 clusters, suggesting those may be appropriate numbers of clusters to group the data into.

Canonical discriminant analyses was used to reduce the number of variables down to those that accounted for the most variance.

We ran the clustering with 5 and output the following graph:

All clusters show significant diffusion indicating that it may be more appropriate to run with less clusters. The graph with 3 clusters is below:

This seems to be creating a more likely grouping of clusters, especially in cluster 3.

Cluster 1 has high income per person, internet use rate, life expectancy, urban rate, and residential electricity. Cluster 2 is dominated by a high armed forces rate and low polity score. Cluster 3 has low values for all variables.

The ANOVA table has a significant p-value less than 0.05

Finally, all clusters are different from each other at the 0.05 confidence level for polity score:

Code begins here:

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Machine Learning - Week 3

This week we’re using LASSO selection to find a model of internet use rate from the gapminder dataset with the appropriate explanatory variables.

10 variables were selected from the gapminder dataset as explanatory variables. There were 9 quantitative variables (alcconsumption, armedforcesrate, femaleemployrate, employrate, hivrate, incomeperperson, lifeexpectancy, oilperperson, relectricperpseron, and urbanrate). Polity score, a categorical variable, was binned into two levels, those with polity scores less than zero, and those with polity scores greater than or equal to zero. All explanatory variables were standardized with a mean of zero and a standard deviation of 1.

We used the least angle regression with k= 10 fold cross validation to estimate the lasso regression. Data was randomly split into a training set with 40 observations and a test set with 16 observations.

Of the 10 variables, only 3 were retained in the selected model from the lasso regression. These were alcohol consumption, income per person, and life expectancy. Together these produced and adjusted R-squared of 0.5186, indicating that they account for a little over half of the variance in the response variable.

The rest of the modeling output follows:

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Regression - Week 2

This week we’ll be performing a basic linear regression using the income per person as the explanatory variable and the armed forces rate as the response variable.

Since our explanatory variable is quantitative, we have centered it at zero in the SAS code and called that variable income_c.

SAS generated the following means table. By looking at the mean of the centered variable income_c, we can see that the mean is approximately zero, indicating that the variable was properly centered.

SAS also produced the following statistical output:

The R-Square statistic is also very small (0.000954) indicating that the income per person in a country explains a very tiny amount of the armed forces rate.

The F statistic is tiny (0.14) for this model and the p-value is large (0.7085), so we cannot conclude that there is a relationship between income per person and the armed forces rate in a country.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Machine Learning - Week 2

We used SAS to machine learn a better decision trees by using the technique of generating a random forest. In this case, we are trying to model the polity score as the response variable (we’ve grouped this into a variable called polity grouping where scores greater than 0 represent increasingly democratic countries and scores less than zero represent increasingly autocratic countries).

We’ve included the armed forces rate (armedforcesrate), the urban rate (urbanrate), the employment rate (employrate), oil consumption per capita (oilperperson), the internet users per 100 people (internetuserate), the residential electric consumption per capita (relectricperpseron), and the GDP per capita (incomeperperson) as possible explanatory variables.

The code to create the random forest follows:

This produced the following output :

Note, SAS produces a table with 100 rows as it gradually refines the tree. We’ve included the first and last 10 rows, similar to the video lecture.

We can see at the beginning, the out of bag misclassification rate was 0.395, indicating that the model was only correctly predicting approximately 60% of the time. This gradually decreased, and by the final run, the out of bag misclassification rate was down to 0.31.

In terms of which variables are important, we can see from the last table that the armedforcesrate has the highest OOB (out-of-bag) Gini score, followed by oilperpseron and employrate.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Machine Learning - Week 1

We ran the following code to create a decision tree with the goal of developing a model to predict a polity score of less than zero.

The code output the following model:

The code also output the following results:

From the original gapminder data, we selected the following explanatory variables: employrate, armedforcesrate, urbanrate, incomeperperson, relectricperperson, oilperperson, and internetuse rate. We grouped these into binary classification variables in an attempt to model the polity score.

The output of the model focuses on the groupings for the employrate, armedforcesrate, and oilperperson.

From the color coding of the classification tree, we can see that the left side nodes are relate to polity scores greater than or equal to zero (i.e. more democratic) and the right nodes relate to polity scores less than zero (i.e. more autocratic).

By subtracting the error rate from 1, we can see how successful the model is in predicting the polity score. Given the results, the model can successfully predicted a polity score of less than zero approximately 48% of the time (1 - 0.5204), and the polity score of greater than zero approximately 83% of the time (1 - 0.1739). Overall, the model correctly classified 67% of the sample (47 + 95 / (47 + 51 + 20 + 95)).

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Introduction to Regression - Week 1

Background

I have been using the Gapminder data to examine whether the polity score (a measure of how democratic or autocratic a country is) is associated with the armed forces rate.

Step 1 - Sample

The subset of Gapminder data provided for this course consists of a sample of 213* countries and areas (for example, Antigua and Barbuda are considered one area) containing data for health, political and economic information. The study reports findings at the aggregate country level. The sample we examined included all countries that did not have missing data.

Step 2 - Data Collection

Data reporting by countries or other bodies was used to compile the Gapminder data. The purpose of Gapminder is to aggregate data across sources by country for a variety of health, political and economic indicators. Generally, the data is compiled from the Institute for Health Metrics and Evaluation, the US Census Bureau’s International Database, the United Nations Statistics Division, and the World Bank.

In the case of the particular variables we are examining, the polity score is sourced from the Polity IV Project, and the armed forces rate is from the WDI (World Development Indicators**). The polity score data is from 2009 and the codebook does not state when the armed forces rate data is from. Given the amount of annual data from the World Development Indicators, I cannot speculate on the time frame of the armed forces rate provided in the Coursera Gapminder package.

The polity score is determined by analysts examining 6 specific components of executive recruitment, constraints on executive authority and political competition. The armed forces rate is compiled from “officially recognized international sources”*** by the World Development Bank.

Step 3 - Measurements

The polity score (explanatory) is represented by a number from -10 to +10 calculated by subtracting a country’s autocracy score from their democracy score and summarizes a country’s democratic and free nature.

The armed forces rate (response) represents the armed forces personnel as a percentage of the total labor force and ranges from 0% to 10.64%.

The primary way I’ve managed the variables is to remove the missing data. Also, depending on the week in question and what type of analysis was asked for, the polity score may have been binned into a smaller number of groupings (for example -10 to -5 may have been grouped together), or the armed forces rate may have been binned to create a categorical variable rather than a quantitative variable for use in Chi-squared tests.

Notes

*The codebook states there are 215 countries and areas, however actually looking at the data in SAS, there are only 213.

**The codebook included in Coursera has a typo, it is the World Development Indicators, not Work Development Indicators.

***https://data.worldbank.org/products/wdi

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Data Analysis Tools - Week 4

The purpose of this post is to determine whether the urban rate acts as a moderator of the relationship between polity score (response) and armed forces rate (explanatory). We grouped the polity score into two groups, countries with polity scores < 0 and countries with polity scores >= 0 and have labeled it polity grouping. Similarly we’ve grouped the urban rate into countries with urban rates < 60 and those with urban rates >= 60 and called it the urban grouping.

The following code will create two ANOVA tables to interpret based on whether the urban rate is >= 60 or not:

It produces the following results:

We can see that both urban groupings are significant with the p-value for the urban grouping of 0 equal to 0.0096 and the p-value for when it is equal to 1 of 0.0003, which are both less than 0.05. However, in both subgroups of the urban rate, the mean of the armed forces rate is larger for a polity grouping of zero. In other words, the urban grouping does not act as a moderator on the armed forces rate, since in both subgroups of the urban rate, the mean for politygrouping of zero is larger.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Data Analysis Tools - Week 3

For the purposes of using the Pearson Correlation Coefficient, we’ll examine if there is a correlation between the armed forces rate and the employment rate using the gapminder data.

The code follows:

We can see from the scatter plot what may be a weak negative correlation between the employment rate and the armed forces rate.

Looking at the Pearson Correlation Coefficient, we can see a weak rate of -0.27486, with a significance of 0.004, which is less than the threshold of 0.05.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Data Analysis Tools - Week2

The purpose of this week is to examine whether the grouped armed forces rate (explanatory categorical variable) impacts the polity score grouping (response categorical variable), using a Chi-squared tests and correctly using post-hoc testing to determine which explanatory variables can be determined to be different if the original Chi-squared test indicated significance.

Again:

H0 = Polity Score grouping is equal across all of the armed forces groupings

Ha = Polity Score grouping is not equal across all armed forces grouping

Given that there are five levels in the armed forces grouping, to correctly adjust the p-value using the Bonferroni Adjustment, we need to divide the normal significance level of 0.05 by the number of pair-wise comparisons to protect against Type 1 errors. In our case, with 5 levels, “5 choose 2″ results in 10 comparisons, so our significance threshold is 0.05 / 10 or 0.005.

We grouped the polity score into two groups, those countries with polity scores less than zero which are more autocratic, and those with polity scores greater than zero which are more democratic.

The code to produce the analysis follows:

The analysis which resulted:

Since there are more than two levels, the Chi-Square value and probability do not give enough information on their own to determine significance. We must use pairwise comparisons to determine significance with a p-value that has been adjusted by the Bonferroni adjustment.

The 10 pairwise tables are listed at the bottom, but the following table summarizes the p-values for the pairwise Chi-squared tests.

From the table, 0.0007 is less than our significance level of 0.005, so we can reject the null hypothesis, since armed forces grouping rates of 0 have a significantly different polity grouping than armed forces grouping rates of 4.

The actual pairwise tables:

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Data Analysis Tools - Week 1

The purpose of this exercise is to perform ANOVA on the gapminder dataset to determine if there is a relationship between the Polity Score (categorical explanatory variable) and the Armed Forces Rate (quantitative response variable).

Specifically,

H0 = the mean of the armed forces rate for countries with polity scores ≥ 0 are the same as the means of countries with polity scores < 0

Ha = the means are not equal

We ran the following SAS code to perform ANOVA on armedforcesrate~polityscore

This resulted in the following ANOVA:

Given that the p-value of 0.0003 is less than 0.05, this ANOVA provides evidence against the null hypothesis. Since our explanatory categorical variable had only two levels, there is no further post-hoc testing needed.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Data Management and Visualization - Week 4

The code for the frequency tables and graphing:

The univariate graph of polityscore:

The graph is unimodal with the largest polity grouping is “+6 to + 10″ with over 50% of the countries.

The univariate graph of armed forces rate:

The graph is slightly bimodal with a significant peak of greater than 50% of countries at “0 to 1%” and a very small peak at 4%+ which is likely due to the grouping.

The univariate graph of employment rate

The graph is unimodal with a slight skew right, but appears almost normal.

There appears to be no relationship between the armed forces rate and the polity score given the lack of obvious trend line.

There seems to be no apparent relationship between the employment rate and the polity score given the lack of obvious trend line.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Data Management and Visualization - Week 3

The code to group data:

We removed any country from the results that did not have data for either the following variables: polityscore, armedforcesrate, and employrate.

We created three new variables called politygrouping, armedgrouping and employgrouping to group the polityscore, armedforcesrate and employrate, respectively. In the case of armedforcesrate and employrate, this was done since these are both continuous variables, and resulting tables would have rows containing one unique value for each country.

The most likely politygrouping is “+6 to +10″, representing over 50% of the countries. The most common armedgrouping is “0 - 1%” representing nearly 54% of the countries. Finally, the employgrouping is more spread out, with the most common range being “60-70%”, but representing only 38% of the respondents.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Data Management and Visualization - Week 2

SAS code to create frequency tables from gapminder data:

This produced three tables. Note, the armedforcesrate and employrate are continuous variables, so the data was grouped, otherwise there would have been one unique value for each country.

There are 213 countries included in the data set, 52 countries did not have a polity score as indicated by the missing data in the first table. Approximately 50% of the countries that responded have polity scores higher than 6 on a scale of -10 to +10.

The data set is missing data for the armed forces rate on 49 countries as indicated by an armedgrouping of -1. Of the Countries that provided data, 89 have armed forces rates between 0 and 1% of the labor force, and 75 have armed forces rates greater than 1% of the labor force.

Similarly, the data set is missing for the employment rate for 35 countries as indicated by the employgrouping of -1. Of the responding countries, 37 have employment rates of less than 50%, and 141 have employment rates of greater than 50%.

0 notes

freshcolortidalwave-blog · 8 years ago

Text

Data Management and Visualization - Week 1

After looking through the Gapminder codebook, I’ve chosen to examine if there is an association between the polity score (i.e. how democratic a country is) and the armed forces rate. For reference, a higher polity score represents a more democratic country. The two variables in the Gapminder codebook to be examined are called ‘polityscore’ and ‘armedforcesrate’.

My hypothesis is that there is negative correlation between the ‘polityscore’ and the ‘armedforcesrate’ of a country. I would expect a more democratic country to have a lower armed forces rate.

I tried several different searches in Google Scholar including ‘polity score and armed forces’, ‘polity score and military’, ‘polity IV’, and finally ‘polity IV military’.

The specific question does not appear to be addressed, however, several papers discussed the rate of military spending and the polity score (see bottom of post for citations). The rate of military spending may be a suitable proxy for the armed forces rate.

The studies found that there is a negative correlation between democracy and military spending. However, one study acknowledged that polity may not be the strongest determinant and that the question had not been widely studied.

Military expenditures and political regimes: Evidence from global data, 1963–2000; ÜnalTöngür, SaraHsu, Adem YavuzElveren; Economic Modelling; Volume 44, January 2015, Pages 68-79.

Kantian Liberalism, Regime Type, and Military Resource Allocation: Do Democracies Spend Less?; Benjamin O. Fordham, Thomas C. Walker; International Studies Quarterly, Volume 49, Issue 1, 1 March 2005, Pages 141–157,https://doi.org/10.1111/j.0020-8833.2005.00338.xPublished: 10 February 2005

0 notes