learning-and-sharing
learning-and-sharing
Learning and Sharing
9 posts
Don't wanna be here? Send us removal request.
learning-and-sharing · 7 years ago
Text
Regression Modeling in Practice: Assignment 1
Writing about a sample from the Gapminder dataset 
 Sample: The GapMinder dataset contains 192 UN member countries as observation points, and is a compilation of around 200 socio-economic indicators (mostly economic, demographic, environmental and health related) for these countries for at least the last 20 years. Information on these variables has been compiled from various other reliable and established data sources  like the World Bank, UN Statistics Division etc. by the organisation GapMinder. Some of the important variables in the sample are per capita income, unemployment rates, carbon dioxide emissions, etc. The sample described here is selected from the Gapminder dataset in the context of the exploring the association between alcohol consumption and urbanization in countries. This particular sample contains 183 UN member countries, selected for availability of data along two specific variables- alcohol consumption per adult and rate of urban population- for the year 2008. 
Procedures: The GapMinder dataset compiles data from other data sources. Here, data on alcohol consumption per adult are compiled by the World Health Organisation (WHO). On the other hand, data on percentage of population living in urban areas (or rate of urbanization) are estimated and prepared by the World Bank. The procedures used are data reporting and surveys. 
Measures:  WHO has been collecting data on alcohol consumption and alcohol control policies from its Member States since 1996. These data are collected and compiled through multiple surveillance as well as survey tools. For example, data on alcohol consumption are collected through the Noncommunicable Diseases (NCDs) Global Monitoring Framework, the Global school-based student health survey (GSHS) and STEPS survey (The WHO STEPwise approach to Surveillance (STEPS) is a simple, standardized method for collecting, analysing and disseminating data in WHO member countries. The STEPwise approach to risk factor surveillance questionnaire includes a module on alcohol consumption). The data compiled by WHO is available for download on the GapMinder website. For the data on urban population rate compiled by the World Bank, urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects). There was no need to further manage the data on these two variables for an analysis, because both variables are quantitative and a regression analysis can be easily done. 
2 notes · View notes
learning-and-sharing · 7 years ago
Text
Data Analysis Tools: Assignment 4
Testing a potential moderator 
Data and Question:
For the purpose of this assignment, I use data taken from the GapMinder dataset. The following questions are asked
What is the nature and strength of association between female employment rate and internet use rate of countries?
Does per capita income of countries serve as a moderator in the above association?
Association between internet use rate and female employment:
When analysing this association in absence of a moderator, we find that there is a very mild negative association between internet use rate and female employment which is statistically insignificant. 
Tumblr media Tumblr media
Next, I introduce per capita income as a moderator by subsetting the data set into low income and high income countries (using median as the dividing line for categorisation). The results are as follows:
1) There is a statistically significant negative association between internet use rate and female employment rate in low income countries. 
Tumblr media Tumblr media
2) There is a statistically significant positive association between internet use rate and female employment rate in high income countries:
Tumblr media Tumblr media
Conclusion:
Per capita income does serve as a moderator in the association between internet use rate and female employment rate. 
0 notes
learning-and-sharing · 7 years ago
Text
Data Analysis Tools: Assignment 3
Pearson’s correlation
Data and Question:
The two correlation analyses described in this assignment were run on data taken from the GapMinder dataset. Two questions were asked:
Is there a statistically significant association between female employment rate and per capita income of countries? 
Is there a statistically significant association between urbanization and per capita income of countries?
Correlation between per capita income and female employment:
Tumblr media
From the almost zero value of Pearson’s coefficient (and large p-value) we find that there is no significant positive or negative association between per capita income and female employment rate in countries. This is confirmed by a scatterplot:
Tumblr media
Correlation between urbanization and per capita income:
Tumblr media
From the positive value of Pearson’s coefficient and very small p-value we can conclude that there is a positive association between rate of urbanization and per capita income of countries. This is also seen from the scatterplot:
Tumblr media
0 notes
learning-and-sharing · 7 years ago
Text
Data Analysis Tools: Assignment 2
Running a chi-square analysis 
Data and Question:
This chi-square analysis was run on data taken from the GapMinder dataset. Particularly, the question of interest was the following: Is there an association between the level of urbanization and per capita income of countries? For running a chi-square analysis I need both variables to be categorical. So I categorise countries into two categories of income- high income and low income, and three categories of urbanization- low, middle and high. 
The countries were divided into categories of urbanization (variable:urbanrate) by using quartiles. Countries falling below first quartile are categorised as low urbanized, while countries falling above third quartile are considered high urbanised. The rest are considered middle urbanized. Income categories are created using the median as the dividing line. 
Proportions by category:
The proportion of high income countries in each category of urbanization is displayed in the following table and graph:
Tumblr media Tumblr media
A priori it seems that there is a clear positive association between urbanization and per capita income. The proportion of high income countries in the low urbanized, middle urbanized and high urbanized categories is 14.58%, 44.06% and 95.83% respectively. A chi-square analysis reveals that this difference is statistically significant:
Tumblr media
Post-hoc test:
Here I show results of the post-hoc test using Bonferroni adjustment. The adjusted p value is 0.05 divided by the number of comparisons being made. In this case, 0.05/3= 0.01667
Tumblr media Tumblr media Tumblr media
Conclusion: 
All three pairwise comparisons (using the adjusted p value) show that the difference in proportion of high income countries across urbanization categories is statically significant. Thus there is an association between urbanization and per capita income levels of countries in the GapMinder dataset. 
0 notes
learning-and-sharing · 7 years ago
Text
Data Analysis Tools: Assignment 1
Running analysis of variance
Data and Question:
This ANOVA was run on data taken from the GapMinder dataset. Particularly, the question of interest was the following: Is there a statistically significant difference in the average female employment rates of countries falling in the low income, middle income and high income categories?
The countries were divided into income categories (variable:incomeperperson) by using quartiles. Countries falling below first quartile of per capita income were categorised as low income, while countries falling above third quartile were considered high income. The rest were considered middle income.
Means by categories:
Following table shows the mean female employment rates in the three categories of per capita income:
Tumblr media
A priori it seems that low income countries have higher female employment rate, which falls as we enter into the middle income category, but rises again for the high income category. 
ANOVA Results:
Tumblr media
The p value for F stastistic is  1.13e-06 (or 0.00000113) < 0.05. Thus, the observed difference in mean female employment rates across income categories is stastically significant. However, this does not tell us about the nature of the inequality of means at all. To investigate that we need to conduct a post hoc test, which makes pairwise comparisons. Here I have used the Tukey's Honestly Significant Difference Test.
This shows that mean female employment rate of high income and low income categories are not equal, and of low income and middle income categories are not equal. But the difference of average female employment rate between high income and middle income countries may be statistically insignificant:
Tumblr media
1 note · View note
learning-and-sharing · 7 years ago
Text
Data Management and Visualization: Assignment 4
Creating graphs for exploratory data analysis
For this assignment, I analyse the GapMinder dataset to examine the nature of association between female employment rate (femaleemployrate) and each of the following variables:
Rate of urbanization: Percentage of population living in urban areas of the country in 2008 (urbanrate)
Per capita GDP: GDP per capita in US dollars (base year 2000) of the country in the year 2010 (incomeperperson)
Life expectancy: 2011 life expectancy at birth (in years) in the country (lifeexpectancy)
Internet usage: Percentage of population with access to the world wide web in 2010 (internetuserate)
These variables have been chosen because they serve as approximate indicators of economic development and growth of a country.
Context and questions: 
Females have been employed in home/family based production activities or small scale agricultural/manufacturing related activities in most developing and underdeveloped countries. Much of this labour force participation might reflect more of a necessity for survival than economic empowerment. With economic and social development, however, females of these developed countries may have better opportunities of education, vocational training and labour force participation that is truly empowering. In this context, I try to investigate the association that female labour force participation has with above mentioned indicators of development. 
My question is- has economic development been truly inclusive to females in terms of labour force participation? Alternatively, what drives female employment in the world- necessity or empowerment caused by economic development?
Note that the analysis is conducted on 164 countries, after dropping countries with missing data on relevant variables. 
Univariate analysis:
Female employment rate: This variable is fairly symmetrically distributed, with a median of 48.45% and mean of 48%. 
Tumblr media
Income per person: This variable has a strongly right skewed distribution, with most of the countries concentrated at the lower per capita income levels and few countries having very high per capita income levels. This indicates a massive inequality in per capita income levels in the world. The median of this distribution is around 2542 US dollars but the 25th percentile (meaning that 25% countries lie below this value) is only 654 US dollars. 
Tumblr media
Life expectancy: This variable has a left skewed distribution, with many countries having life expectancy lower than 70 years and a few countries having much higher life expectancy. In particular, the median life expectancy is 73 years while the 25th percentile is only 63 years. The maximum life expectancy is around 84 years. 
Tumblr media
Internet usage: This variable has been converted to a categorical variable for easier analysis of it’s association with female employment. As shown below, 64 countries have less than 20% population with access to the internet, while less than 15 countries fall in the 80-100% range of internet usage. 
Tumblr media
Rate of urbanization: This variable has also been converted to a categorical variable. Of the 5 categories created the highest frequency is for the category 60-80%, with 51 countries falling in this range urbanisation. 
Tumblr media
Associations:
Between female employment and per capita income:
To study this relation, I have created income categories by dividing the countries into groups by quartiles of per capita income. From the graph below, there seems to be a negative association between per capita income and female employment rate, which reverses for countries with the highest per capita income. However, the average female employment rate for the fourth category (countries with per capita income more than 75th percentile) is still lower than the average female employment rate in countries in the first category (per capita income less than 25th percentile). As measured by per capita income, increasing economic development seems to be accompanied by a fall in female employment rate up until very high levels of development are reached. 
Tumblr media
Between female employment and life expectancy:
From the scatter plot (and line of best fit) below, there is a clear negative association between life expectancy and female employment rate. Countries with better life expectancy, on average, have lower female employment rates.
Tumblr media
Between female employment and internet usage:
Tumblr media
The scatter plot above shows a mild negative association between rate of internet usage and female employment. To get a clearer picture, I use internet rate categories and construct a bar graph, shown below. We can see that average female employment rate is high in countries with very low internet usage (1-20%) and drops over the next two categories before rising again. Average female employment has a positive association with internet usage only for countries with more than 60% internet usage. Thus there is a non linear association between female employment rate and internet usage.
Tumblr media
Between female employment and rate of urbanisation: 
Tumblr media
There is a clear negative association between rate of urbanization and female employment rate- indicating that most employed women in the world may be employed in rural areas and productive activities. The rate of female employment falls sharply as rate of urbanization increases, with a mild reversal for highly urbanized countries, as can be seen from the bar graph below. Urbanization is associated with higher female employment only for countries with very high levels of urbanization. 
Tumblr media
Summary:
Contrary to expectations, due to the absence of a positive association between female employment rate and the chosen indicators of development, there is no evidence from the GapMinder dataset that economic development, on average, has caused an increase in female employment. The highest volume of female employment is concentrated in countries that are low on all indicators of development studied. 
0 notes
learning-and-sharing · 7 years ago
Text
Data Management and Visualization: Assignment 3
Data Management and Exploratory Data Analysis
For my project in this course, I am working on the GapMinder data set and attempting to identify the association of certain socio-economic variables with suicide rate in countries. 
The GapMinder dataset has data on 14 variables for 231 countries. Four of the variables that are important for my analysis are the following, which I would be describing for this assignment, are:
‘suicideper100th’: Mortality due to self-inflicted injury, per 100 000 standard population in 2005
‘urbanrate’: Percentage of total population living in urban areas in the year 2008
‘internetuserate’: Internet users (per 100 people) in 2010. Internet users are people with access to the worldwide network.
‘incomeperperson’: Per capita GDP in the year 2010 (calculated in US dollars at 2000 prices) 
There are two main issues that this program deals with:
Converting continuous variables to categorical variables for meaningful analysis of frequency distributions.
Dropping irrelevant variables and rows (countries) with missing data. 
The program is copy- pasted in the last section. Comments accompanying the commands have been added to explain the process. The main results are discussed in the next section.
Steps followed in writing the program:
After reading in my data, I converted all relevant variables to numeric
I saw that all my variables are continuous and attempted to convert them to categorical variables using qcut or pandas.cut commands. This ran into an error due to missing data. 
I assessed the extent of missing data by examining two variables (22 countries had missing data on suicide and 35 countries had missing data on employment)
I first dropped the irrelevant variables from my data set and then I dropped rows with missing data on columns/variables. This dataset now had 161 rows (countries)
Next I converted my continuous variables to categorical variables using pandas.cut command (qcut did not give any meaningful insights) and generated frequency distributions for these categorical variables. 
Results are displayed below:
1) Suicide rate category: 89 out of 161 countries had less than 10 suicide deaths per 1,00,000 population in 2005. While no country had more than 40 suicide deaths per 1,00,000, for 2 countries this number was higher than 30. 
Tumblr media
The command below shows that these two countries are Guyana and Lithuania
Tumblr media
2) Internet usage rate: In 39 out of 161 countries, less than 10% people had access to the internet in 2010. In 25 countries, anywhere between 10 to 20% people had access, while in 5 countries more than 90% people had access to the internet in 2010. 
Tumblr media
The command below shows the 5 countries where internet access was highest in 2010:
Tumblr media
3) Urbanisation: 9 countries fall in the highest category for urbanisation, meaning that in these countries more than 90% of the population was living in urban areas in 2008. In 12 countries, somewhere between 10-20% of population lived in urban areas. A significant majority of the countries (56 out of 151) fell in the 30-60% range of urbanisation. 
Tumblr media
4) Per capita income: Finally, another meaningful was to interpret information from a continuous variable like incomeperperson without converting it to a categorical variable was by examining the quantile distribution of the variable.
Tumblr media
From above command, we can see that the minimum per capita GDP in the dataset is around 104 USD, while the maximum is about 52300 USD. Further, half of the countries have per capita GDP less than 2482 USD (all observations for year 2010). Other variables can also be analysed similarly. 
1 note · View note
learning-and-sharing · 7 years ago
Text
Data Management and Visualisation: Assignment 2
Running my first program on python
The GapMinder dataset provided for analysis as part of the course has data on 231 countries, for 14 variables. The variables that are important for my analysis are the following:
‘incomeperperson’: Per capita GDP in the year 2010 (calculated in US dollars at 2000 prices) 
‘employrate’: Percentage of population over 15 years of age that was employed in the year 2007 
‘femaleemployrate’: Percentage of female population over 15 years of age that was employed in the year 2007 
‘urbanrate’: Percentage of total population living in urban areas in the year 2008
While writing and testing my first program, I faced two issues:
All of the variables of interest to me are continuous variables, such that ‘frequency distribution’ of these variables does not give any useful information. For example, the ‘employrate’ variable can take any value between 0 and 100%, and most countries had unique employment rates- because of which every value was taken with the frequency 1. To gain any meaningful insight, I needed to convert the variables into categorical variables by splitting all possible values into ranges of values. 
The above task was done using the pandas.cut command, but I ran into errors because of missing data. Hence I decided to drop the countries (rows) for which some variables were missing. This reduced my data set to 152 countries. 
After converting continuous variables to categorical variables (ranges), I got the following frequency distributions
Tumblr media Tumblr media
I also calculated the distributions in percentage form, but have not shared them here because they do not add very significantly to our understanding from above frequency distributions.
Summary of univariate analysis
‘incomecategory’: For 99 out of 152 countries (about 65%), per capita GDP was less than 5000 USD in 2010. For another 21 countries (about 14%) it was between 5000 and 10,000 USD. For higher values of per capital income, the number of countries was drastically lower and only 4 countries had GDP in the 35,000-40,000 USD range, which was the highest range. 
‘employmentcategory’: The minimum employment rate observed in this dataset is between 30 to 40%, and 4 countries fall in this range. 59 out of 152 countries (about 39%) fall in the 50-60% employment range while only 6 countries have more than 80% employment rate.
‘fememploymentcategory’: The lowest female employment rate observed in the dataset is between 10 to 20%, with 6 countries falling in this bracket. If we club the ranges to look at the 40-60% range, we find that 84 countries fall in this range. That is, for about 55% of the countries, the female employment rate falls in the 40-60% range. The highest female employment rate is between 80 and 90%, but only 3 out of 152 countries fall in this category. 
‘urbanratecategory’: The rate of urbanisation varies quite significantly across the 152 countries in this dataset. While 12 countries have a rate as low as 10-20%, 7 countries fall in the 90-100% category. A relative majority (around 35%) seems to fall in the broad range of 30-60%, with 53 countries in this range. 
1 note · View note
learning-and-sharing · 7 years ago
Text
Data Management and Visualization: Assignment 1
Getting a research project started 
For our first assignment of this course I have chosen to work with the GapMinder dataset, which includes more than 200 indicators (mostly economic, demographic and health related) for 192 UN member countries, for at least the last 20 years. This dataset has been compiled by GapMinder from established sources like the World Bank, UN Statistics Division etc. 
While examining the codebook of the GapMinder dataset, I became interested in exploring the various socio-economic factors that may be associated with suicide rates across countries. In particular, I am interested in studying the potential association between (low) employment rate and rate of suicide. Further, I would also like to examine the nature of association, if any, between per capita income, urban population percentage (a possible indicator of urbalization), employment rate and rate of suicide. These variables are described in the GapMinder codebook as follows:
Tumblr media
Literature survey
Suicide is a complex phenomenon and the factors driving it may vary in significance across societies/countries. Stack (2010) in a two-part review of sociological literature over 15 years (1981-1995) identifies economic strain, racial differences, and migration (a force lowering social integration) as factors that are found to be positively linked to suicide by various studies. In fact economic strain, particularly in the form of unemployment, has emerged as an important factor driving suicides in many independent studies and journalistic reportage. In 2015, Sarah Boseley reported in the international edition of The Guardian the findings of a study published in the journal Lancet Psychiatry which examined data for 63 countries over the years 2000 to 2011 in the context of the global economic crisis that occurred in between. The study found that the economic downturn and unemployment over the eleven years could be held responsible for 45,000 deaths in the 63 countries. Further, the study also found that of an estimated 233,000 suicides each year in the 63 countries, one in five could be attributed to unemployment. However, an observation was that the effect of unemployment on suicide risk appears to be stronger in countries where being out of work is uncommon. This makes it interesting to study the association between employment rate and suicide rate for a larger set of countries- while we might expect a negative association between the two because of the economic strain that low employment causes, the contrary may also be expected because countries with low employment rates might be ones where unemployment may be common, hence reducing the social pressure towards suicide. 
There is a significant body of literature investigating the relation between unemployment and suicide rates. For instance, Boor (1980) found that the annual variations in suicide rates between 1962 and 1976 for Canada, France, Germany (Federal Republic), Japan, Sweden, and the United States were associated positively and significantly with concomitant annual variations in the unemployment rates.  Although in this study the predicted relationship between these rates was not obtained for England and Wales, a longitudinal study by Lewis and Sloggett (1998) conducted in England and Wales found that the association between suicide and unemployment was more important than the association with other socioeconomic measures. The latter study was conducted on individuals from the Office for National Statistics longitudinal study for whom 1981 census data were available. This study found that there was a strong independent association between suicide and individuals who were unemployed, and estimated that being unemployed increased suicide risk by more than twofold. Blakeley et al (2003) conducted a study on data from New Zealand 1991 census of respondents aged 18-64 years to find that being unemployed increased the chances of death by suicide among 25 to 64 year olds by more than two folds for both men and women (the impact being higher for men). Unemployment was also strongly associated with suicide death among 18–24 year old men. However, Andres (2005) has empirical results from 15 European countries between 1970 and 1998 showing that contrary to prior studies, suicide rates were not sensitive to income levels or unemployment. Based on his analysis he also emphasizes on the importance of employing age-specific suicide rates compared to what has been traditionally used, in trying to evaluate the factors responsible for suicide mortality. 
From the literature it is evident that suicide rates are driven by and associated with a large set of variables within any given society, with different variables taking on a different degree of importance given the social context of the society being studied. For instance, while loss of social integration (caused by migration and urbanization) may be an important driving factor behind suicide in one country, unemployment and indebtedness might be important in another. Further, it is important to study suicide rates in age divided categories, since factors driving suicide of different age groups are widely different. The data available on the GapMinder website makes this possible for many countries. At a later stage of this analysis, I am also interested in studying the association of suicide rates with unemployment benefits and expenditure on welfare schemes given by governments, as far as data is available. 
Research questions
In the context of this literature and given available data, I am interested in the following research questions:
Is there an association between suicide rate and employment rate across countries? What is the nature and magnitude of this association? The hypothesis is that there is a negative association between employment rate and suicide rate for most countries. 
What is the association between per capital income and suicide rate?
What is the nature of association between rate of urbanization with suicide rate?
Bibliography/References
Andres, Antonio Rodriguez. "Income inequality, unemployment, and suicide: a panel data analysis of 15 European countries." Applied Economics 37.4 (2005): 439-451.
Blakely, Tony A., Collings, SCD and Atkinson, June. "Unemployment and suicide. Evidence for a causal association?." Journal of Epidemiology & Community Health 57.8 (2003): 594-600.
Boor, Myron. "Relationships between unemployment rates and suicide rates in eight countries, 1962–1976." Psychological Reports 47.3_suppl (1980): 1095-1101.
Boseley, Sarah. “Unemployment causes 45,000 suicides a year worldwide, finds study” The Guardian N.p, 11 Feb. 2015. Web. 16 June 2018
Lewis, Glyn, and Sloggett, Andy. "Suicide, deprivation, and unemployment: record linkage study." Bmj 317.7168 (1998): 1283-1286.
Preti, A. "Unemployment and suicide." Journal of Epidemiology & Community Health 57.8 (2003): 557-558.
Stack, Steven. "Suicide: a 15‐year review of the sociological literature part I: cultural and economic factors." Suicide and Life-Threatening Behavior 30.2 (2010): 145-162.
Stack, Steven. "Suicide: a 15‐year review of the sociological literature part II: modernization and social integration perspectives." Suicide and Life-Threatening Behavior 30.2 (2010): 163-176.
0 notes