atsmalik-blog - Tumblr blog

atsmalik-blog · 6 years ago

Text

Chi Sq Test of Independence

I looked at the alcohol consumption data and used that as the exploratory variable to see if there was any correlation with the life expectancy of countries in the sample. I used the median life expectancy as the dependent variable and gave it two values of 0 (where the life expectancy was less than the median for the population of the data) and 1 (where the life expectancy was equal to or greater than the medial of the population of the data). I named the response variable “AGE”

I further divided the alcohol consumption into 4 values based on the quartiles of the data on alcohol consumption for the population. I named the explanatory variable “AL_CONSUMPTION”

The null hypothesis is that there is no connection between alcohol consumption and whether people live to the median age or longer across all the countries.

CODE and analysis:

data work.Gapminder_Data (KEEP= country alcconsumption AL_CONSUMPTION lifeexpectancy AGE );

SET mydata.gapminder;

/*remove the nil values*/

IF alcconsumption ne '.';

IF lifeexpectancy ne '.';

LABEL AGE = 'UNDER OR OVER MEDIAN AGE';

LABEL AL_CONSUMPTION = 'BY QUARTILES OF DATA POPULATION';

/*Explanatory variable is amount of alcohol consumed per capita. The data is divided by quartiles*/

IF alcconsumption <= 2.56 THEN AL_CONSUMPTION = 2.470; /*Quartile 1*/

ELSE IF alcconsumption <=5.92 THEN AL_CONSUMPTION = 5.865; /*Quartile 2*/

ELSE IF alcconsumption <=9.99 THEN AL_CONSUMPTION = 9.87; /*Quartile 3*/

ELSE IF alcconsumption >9.99 THEN AL_CONSUMPTION = 23.01; /*Quartile 4*/

/*Response variable is life expectancy less than the median or equal to/greater than median age of 73.13*/

IF lifeexpectancy < 72.5585 THEN AGE = 0;

ELSE IF lifeexpectancy >=72.5585 THEN AGE = 1;

run;

proc freq; tables AGE*AL_CONSUMPTION/CHISQ;

RUN;

The output of the data shows that the p value is 0.0001, so we can reject the null hypothesis and say that the alcohol consumption does have an impact on whether the person lives to median age or longer. The scatter plot also shows that there is some correlation by showing groupings.

To look at what consumption levels have a difference, I then did the post hoc test. I used the Bonferroni Adjustment which says that I have to look at p values of less than 0.008 only for significance as there are six comparisons that I am making here.

DATA COMPARE1; SET gapminder_Data;

IF AL_CONSUMPTION = 2.470 OR AL_CONSUMPTION = 5.865;

proc sort; by country;

proc freq; tables AGE*AL_CONSUMPTION/CHISQ;

RUN;

The output shows that the p value (0.3388) is not significant so I will accept the null hypothesis and say that between quartile 1 and quartile 2 consumption, there is no relationship with the median age life expectancy.

DATA COMPARE2; SET gapminder_Data;

IF AL_CONSUMPTION = 2.470 OR AL_CONSUMPTION = 9.87;

proc sort; by country;

proc freq; tables AGE*AL_CONSUMPTION/CHISQ;

RUN;

The output shows that the p value (0.0583) is not significant so I will accept the null hypothesis and say that between quartile 1 and quartile 3 consumption, there is no relationship with the median age life expectancy

DATA COMPARE3; SET gapminder_Data;

IF AL_CONSUMPTION = 2.470 OR AL_CONSUMPTION = 23.01;

proc sort; by country;

proc freq; tables AGE*AL_CONSUMPTION/CHISQ;

RUN;

The output shows that the p value (0.0011) is significant (less than Bonferroni Adjusted value of 0.008) so I will reject the null hypothesis and say that between quartile 1 and quartile 4 consumption, there is a relationship with the median age life expectancy

DATA COMPARE4; SET gapminder_Data;

IF AL_CONSUMPTION = 5.865 OR AL_CONSUMPTION = 9.87;

proc sort; by country;

proc freq; tables AGE*AL_CONSUMPTION/CHISQ;

RUN;

The output shows that the p value (0.0053) is significant (less than Bonferroni Adjusted value of 0.008) so I will reject the null hypothesis and say that between quartile 2 and quartile 3 consumption, there is a relationship with the median age life expectancy

DATA COMPARE5; SET gapminder_Data;

IF AL_CONSUMPTION = 5.865 OR AL_CONSUMPTION = 23.01;

proc sort; by country;

proc freq; tables AGE*AL_CONSUMPTION/CHISQ;

RUN;

The output shows that the p value (<0.0001) is significant (less than Bonferroni Adjusted value of 0.008) so I will reject the null hypothesis and say that between quartile 2 and quartile 4 consumption, there is a relationship with the median age life expectancy

DATA COMPARE6; SET gapminder_Data;

IF AL_CONSUMPTION = 9.87 OR AL_CONSUMPTION = 23.01;

proc sort; by country;

proc freq; tables AGE*AL_CONSUMPTION/CHISQ;

RUN;

The output shows that the p value (0.149) is not significant so I will accept the null hypothesis and say that between quartile 3 and quartile 4 consumption, there is no relationship with the median age life expectancy

#DATA COMPARE1; SET gapminder_Data; IF AL_CONSUMPTION = 2.470 OR AL_CONSUMPTION = 5.865; proc sort; by country; proc freq; tables AGE*AL_CONS

0 notes

atsmalik-blog · 6 years ago

Text

ANOVA test

For the purpose of this assignment, I decided to use the polity score as the explanatory variable to see if that has an impact on the average income per person (quantitative variable). The null hypothesis is that there is no influence of the polity score on the average income per person.

I removed the null values from both the polity score and income per person in the data.

I collapsed the 21 categories of the polity score into 4 categories:

Group A: -6 to -10

Group B: 0 to -5

Group C: 1 to 5

Group D: 6 to 10

Code:

data work.gap; set mydata.gapminder;

/*Remove null values*/

IF polityscore ne .;

IF incomeperperson ne .;

/*Creating a 4 category variable for polity score*/

IF polityscore <= -5 THEN politygroup2 = 'A';

IF polityscore > -5 AND polityscore <= 0 THEN politygroup2 = 'B';

IF polityscore > 0 AND polityscore <= 5 THEN politygroup2='C';

IF polityscore > 5 THEN politygroup2='D';

proc sort;

by polityscore;

run;

/*Running ANOVA model using only four categories of the polity score with the DUNCAN post hoc test*/

proc anova; class politygroup2;

model incomeperperson = politygroup2;

means politygroup2/DUNCAN;

run;

The results come back as the following:

The data shows very low p value of 0.0006 , so we can safely reject the null hypothesis and accept the alternate hypothesis and we can state that the polity score has a statistically significant influence on the average income per person.

The box plot gives us a very interesting picture: the highest and the lowest polity score groups have the higher means and greater distributions.

However, the population was divided into 4 categories according to ascending polity score with ‘A’ and ‘B’ representing polity score of 0 or below and ‘C’ and ‘D’ representing polity score of greater than 0. I want to see which categories are significantly different from each other. For this, I ran the DUNCAN post hoc test.

The output of the post hoc test aligns with the boxplot graph. It shows that income per person in groups A and D, the groups with the lowest and the highest polity score respectively, are not statistically different from each other. Similarly, incomes between A, B and C are not statistically different. However, incomes between D and B and C are statistically different.

Out of curiosity, I ran a test on all 21 levels. The results are very interesting and have a very low p value <0.0001, and the boxplot shows that countries with -10 polity score have higher mean and median income and fewer outliers compared to countries with 10 polity score. Countries with 10 polity score have lower and upper outliers far in excess of countries with -10 score, showing greater income disparity.

A DUNCAN post hoc test shows that the incomes of countries with -10, 10 and -8 are not statistically different. And -8 and -2 score countries are also not different.

The statistically significant difference is between the rest of the population and groups with scores of -10,10 and -8 (except -8 and -2)

0 notes

atsmalik-blog · 6 years ago

Text

Graphing variables

In the gapminder data, I first looked at the distribution of the countries in the dataset by their polity score to see if the data was skewed.

Code:/*Looking at the distribution of the countries by their polity score*/ proc gchart; vbar polityscore/discrete type=PCT;

This gives a (mostly) uni-modal distribution with the highest percent of the gapminder countries with a polity score of 10. The data is left skewed with a high number of countries scoring less than 10. There is a slight peak at score of -7

I also took the urban rate and collapsed the data to the nearest tenths

/*Rounding to the nearest tenths*/ IF urbanrate < 20 THEN URBAN_RATE = 10; ELSE IF urbanrate < 30 THEN URBAN_RATE = 20; ELSE IF urbanrate < 40 THEN URBAN_RATE = 30; ELSE IF urbanrate < 50 THEN URBAN_RATE = 40; ELSE IF urbanrate < 60 THEN URBAN_RATE = 50; ELSE IF urbanrate < 70 THEN URBAN_RATE = 60; ELSE IF urbanrate < 80 THEN URBAN_RATE = 70; ELSE IF urbanrate < 90 THEN URBAN_RATE = 80; ELSE IF urbanrate < 100 THEN URBAN_RATE = 90; ELSE IF urbanrate = 100 THEN URBAN_RATE = 100

This indicates a unimodal distribution with the peak at 60% of the population in urban areas

I also looked at the female employment rate and used the medial value to collapse the data into two bins; under the median value and over the median value:

/*descriptive statistics for the employment rate*/ proc univariate; var femaleemployrate employrate; run;

/*Collapsing the data for female employment rate into two categories using the median as the cut off point */ IF femaleemployrate <= 48.6 THEN FER = 0; ELSE IF femaleemployrate > 48.6 THEN FER = 1;

I made a bar chart to look at any relationship between female employment rate and polity score

/*Looking at the distribution of female employment by polity score*/ PROC GCHART; VBAR polityscore/discrete TYPE = mean SUMVAR=FER;

This shows that female employment rate is not influenced by a high polity score

Finally, i looked at the relationship between overall employment rate and female employment rate by a scatter plot.

/*scatterplot employrate and female employment rate*/ proc gplot; plot employrate*femaleemployrate; run;

This shows a very high relationship between the two employment rates

0 notes

atsmalik-blog · 6 years ago

Text

Assignment Data management decisions

I am using the Gapminder dataset for my project and I am looking at the impact of democracy score and urban rate on female employment in different countries. Hence the variables that I need for my project are Country, femaleemployrate, employrate, polityscore and urbanrate.

The variables that I am using for frequency tables are as follows:

Polityscore, urbanrate, employrate and femaleemployrate. However, Polityscore is the only variable where a clear frequency count can be done. For the other variables, I need to aggregate the scores by rounding up. I have given examples of how the data looks like without rounding and then with rounding. I have also tried different rounding techniques and to look at the impact of that, I have created new columns where the output of my rounding is shown.

Since I am not interested in the countries where the data is missing, I have used the SAS code to drop all those cases where there are missing values in any of the variables that I am interested in.

CODE:

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

/*Getting the link to the gapminder dataset and keeping only the variables that I require to reduce noise*/

data gapminder (KEEP= Country incomeperperson femaleemployrate employrate polityscore urbanrate urban_rate urban_rate_a employ_rate female_employ_rate);

set mydata.gapminder;

/*Creating new aggregate variables columns*/

Label polityscore="2009 Democracy Score"

Label urbanrate="Urban rate as per data"

Label urban_rate="Urban rate rounded to nearest decile"

Label urban_rate_a="Urban rate rounded to the nearest decile rounded up"

Label employ_rate="Employment rate rounded to nearest decile"

Label female_employ_rate="Female employment rate rounded to nearest decile"

;

/*Aggregating data, removing null values*/

IF polityscore ne .; /*Removing the 52 countries where there is no polity score*/

IF urbanrate ne .; /*Removing the 1 value where the data is missing*/

IF femaleemployrate ne .; /*Removing the 3 cases where data is missing*/

IF urbanrate LE 10 THEN URBAN_RATE = 'UNDER 10'; /*Rounding to the nearest tenths*/

ELSE IF urbanrate < 20 THEN URBAN_RATE = 10;

ELSE IF urbanrate < 30 THEN URBAN_RATE = 20;

ELSE IF urbanrate < 40 THEN URBAN_RATE = 30;

ELSE IF urbanrate < 50 THEN URBAN_RATE = 40;

ELSE IF urbanrate < 60 THEN URBAN_RATE = 50;

ELSE IF urbanrate < 70 THEN URBAN_RATE = 60;

ELSE IF urbanrate < 80 THEN URBAN_RATE = 70;

ELSE IF urbanrate < 90 THEN URBAN_RATE = 80;

ELSE IF urbanrate < 100 THEN URBAN_RATE = 90;

ELSE IF urbanrate = 100 THEN URBAN_RATE = 100;

ELSE URBAN_RATE ='CHECK';

urban_rate_a = round(urbanrate,10); /*Rounding to the nearest tenths rounding up*/

IF employrate LE 10 THEN EMPLOY_RATE = 'UNDER 10'; /*Rounding to the nearest tenths*/

ELSE IF employrate < 20 THEN EMPLOY_RATE = 10;

ELSE IF employrate < 30 THEN EMPLOY_RATE = 20;

ELSE IF employrate < 40 THEN EMPLOY_RATE = 30;

ELSE IF employrate < 50 THEN EMPLOY_RATE = 40;

ELSE IF employrate < 60 THEN EMPLOY_RATE = 50;

ELSE IF employrate < 70 THEN EMPLOY_RATE = 60;

ELSE IF employrate < 80 THEN EMPLOY_RATE = 70;

ELSE IF employrate < 90 THEN EMPLOY_RATE = 80;

ELSE IF employrate < 100 THEN EMPLOY_RATE = 90;

ELSE IF employrate = 100 THEN EMPLOY_RATE = 100;

ELSE EMPLOY_RATE ='CHECK';

/* female_employ_rate = round(femaleemployrate,10); /*Rounding to the nearest tenths rounding up, not used*/

IF employrate LE 10 THEN FEMALE_EMPLOY_RATE = 'UNDER 10'; /*Rounding to the nearest tenths*/

ELSE IF femaleemployrate < 20 THEN FEMALE_EMPLOY_RATE = 10;

ELSE IF femaleemployrate < 30 THEN FEMALE_EMPLOY_RATE = 20;

ELSE IF femaleemployrate < 40 THEN FEMALE_EMPLOY_RATE = 30;

ELSE IF femaleemployrate < 50 THEN FEMALE_EMPLOY_RATE = 40;

ELSE IF femaleemployrate < 60 THEN FEMALE_EMPLOY_RATE = 50;

ELSE IF femaleemployrate < 70 THEN FEMALE_EMPLOY_RATE = 60;

ELSE IF femaleemployrate < 80 THEN FEMALE_EMPLOY_RATE = 70;

ELSE IF femaleemployrate < 90 THEN FEMALE_EMPLOY_RATE = 80;

ELSE IF femaleemployrate < 100 THEN FEMALE_EMPLOY_RATE = 90;

ELSE IF femaleemployrate = 100 THEN FEMALE_EMPLOY_RATE = 100;

ELSE FEMALE_EMPLOY_RATE ='CHECK';

/*sorting the data by the name of the countries alphabetically*/

proc sort; by country;

/* Frequency by polity score to give me a distribution of the countries by their polity score*/

proc freq;

tables polityscore urbanrate urban_rate urban_rate_a femaleemployrate female_employ_rate employ_rate ;

run;

To show why I needed to aggregate data, I have given a partial output of the original data below for urban rate:

My final outputs, with my comments are as below:

The highest scores in the data are for countries with a polity score of over 9.5 (10 rounded) as that is 20.38% of the group. The second highest is score of between 7.5 and 8.4 (8 rounded) with 11.46% of the group. Cumulatively, about a third of the countries (33.12%) have a polity score of 0 or less.

To do the aggregate data, I tried two methods: One was to round up to the nearest deciles (example: 65.5 becomes 70) and the other was to group within the tenths (example: any values between 60 and 69 are grouped as 60)

The example of the first output is given below:

This table shows what happens if the numbers are rounded up. The highest urban rate is 70 with 18.47% of countries having a population between 65.5% and 70.4% living in the cities. Around 47% of the countries have half or less of their population not living in the cities so, overall, there is almost 50-50 split between urban and rural population in the sample of the countries

However, if we round it to the nearest decile, without rounding up, then the data looks like the table given below:

Now, we see that the highest urban rate is 60 with 19.75% of the countries having population between 60 and 69% of their total population living in cities. I have decided to go with this format for the rest of the data aggregates.

Only 4% of the countries have overall employment rate in the 80s (aggregated score 80). 56% of the countries have employment rate below 59%. Employment in the 50s (50-59%) is also the highest subset, with 38% of the countries having that employment rate.

Only 4 countries (2.5% of the sample) have their female population in the 80s in employment (aggregate 80). 44 countries (28% of the sample) have 40 to 49% of their females over the age of 15 in employment. Only 17% of the countries in sample have female employment rate of greater than 60%.

0 notes

atsmalik-blog · 6 years ago

Text

Assignment 1: Frequency distribution

The variables that I am using for frequency tables are as follows:

Since I am not interested in the countries where the data is missing, I have used the SAS code to drop all those cases where there are missing values in any of the variables that I am interested in.

CODE:

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

/*Getting the link to the gapminder dataset and keeping only the variables that I require to reduce noise*/

data gapminder (KEEP= Country incomeperperson femaleemployrate employrate polityscore urbanrate urban_rate urban_rate_a employ_rate female_employ_rate);

set mydata.gapminder;

/*Creating new aggregate variables columns*/

length urban_rate 3.;

length urban_rate_a 3.;

length employ_rate 3.;

length female_employ_rate 3.;

Label polityscore="2009 Democracy Score"

Label urbanrate="Urban rate as per data"

Label urban_rate="Urban rate rounded to nearest decile"

Label urban_rate_a="Urban rate rounded to the nearest whole number"

Label employ_rate="Employment rate rounded to nearest decile"

Label female_employ_rate="Female employment rate rounded to nearest decile" ;

/*Aggregating data, removing null values*/

IF polityscore ne .; /*Removing the 52 countries where there is no polity score*/

IF urbanrate ne .; /*Removing the 1 value where the data is missing*/

IF femaleemployrate ne .; /*Removing the 3 cases where data is missing*/

urban_rate = round(urbanrate,10); /*Rounding to the nearest tenths*/

urban_rate_a = ceil(urbanrate); /*Rounding to the nearest whole number*/

employ_rate = round(employrate,10);/*Rounding to the nearest tenths*/

female_employ_rate = round(femaleemployrate,10);/*Rounding to the nearest tenths*/

/*sorting the data by the name of the countries alphabetically*/

proc sort; by country;

/* Frequency by polity score to give me a distribution of the countries by their polity score*/

proc freq;

tables polityscore urbanrate urban_rate urban_rate_a femaleemployrate female_employ_rate employ_rate ;

run;

Output:

To show where the data still have missing values and to show why I needed to aggregate data, I have given a partial output of the original data below for urban rate

My final outputs, with my comments are as below:

The highest urban rate is 70 with 18.47% of countries having a population between 65.5% and 70.4% living in the cities. Around 47% of the countries have half or less of their population not living in the cities so, overall, there is almost 50-50 split between urban and rural population in the sample of the countries

Only 9% of the countries have overall employment rate between 75.5 and 80.4 (aggregated score 80). 73% of the countries have employment rate below 60.4

Only 9 countries (5.73% of the sample) have between 75.5 and 80.4% of their female population in employment (aggregate 80). 55 countries (35% of the sample) have half of their females over the age of 15 in employment.

0 notes

atsmalik-blog · 6 years ago

Text

I would like to see if there is any correlation between female employment and democracy. Do countries with higher democracy score have more female employment compared to those countries with low democracy score? I believe that to be the case so my hypothesis is that there is a positive correlation between higher democracy score and high female employment.

As a secondary impact, I would also like to explore if urban rate plays a part in female employment, i.e. my hypothesis being that countries with higher urban population have higher female employment.

For this, I will be using the GapMinder dataset.

There have been studies done on democracy and female participation. A study done by Ulf Brunnbauer ( Ulf Brunnbauer. “From equality without democracy to democracy without equality? Women and transition in southeast Europe”. SEER - South-East Europe Review for Labour and Social Affairs 03:151-168.https://www.ceeol.com/search/article-detail?id=239873 ) looks the the European case but that is more as participation in the process of democracy

“Labor Market Attitudes and Experienced Political Institutions” by Ugo A. Troiano (dated 14 January 2018, https://mpra.ub.uni-muenchen.de/83927/) states that women who have experienced democratic institutions during their adolescence are more likely to participate in the labor market, keeping constant the country, age and many other confounding factors. The paper looks at Egypt, Spain, Afghanistan and Syria. However, this paper looks at the age of the female workforce and the impact of the institutions during the earlier years in life. The paper concludes that there is evidence that democratization makes the female participation in the labor market easier. This is what I aim to prove as well.

A paper “Gender Role Attitudes and the Labour-market Outcomes of Women across OECD Countries” by Nicole M Fortin (Oxford Review of Economic Policy, Volume 21, Issue 3, 1 October 2005, Pages 416–438) looks at changing attitudes towards women in workforce but does not bring in the political structure impact.

“Attitudes, Policies and Work” by Giavazzi, Francesco, Fabio Schiantarelli and Michel Serafinelli (Journal of the European Economic Association 2013) looks at cultural attitudes as significant determinants of the employment rates of women.

“Globalising Gujarat: Urbanisation, Employment and Poverty” by Amitabh Kundu (Economic and Political Weekly Vol. 35, No. 35/36 (Aug. 26 - Sep. 8, 2000), pp. 3172-3179+3181-3182 ) shows that there is a correlation between urbanisation and employment which is the secondary question that I want to tackle in this project.

“Trends and Structure of Employment in the 1990s: Implications for Urban Growth” by the same author (Economic and Political Weekly Vol. 32, No. 24 (Jun. 14-20, 1997), pp. 1399-1405) supports this as well.

1 note · View note