Tumgik
final Coursera SAS first part assignment
In this last part we will see if there are any relationships between my variables: incomeperperson also know as the GDP per capita, Breast cancer incidence per 100TH, alcohol consumption in L per year and finally suicide per 100k.  Since each of these variables are quantitative I have to cease utilizing the separation of frequency I used in the previous assignment.
The program is
Tumblr media
First pop out the univariate analyses:
First Income per person:
Tumblr media Tumblr media Tumblr media
Sorry for the repitition.  As can be seen Income per person is all over the place which would be expected due to the vast difference in wealth between places such as Qatar and Niger.
Next is Breast cancer results from univariate
Tumblr media Tumblr media Tumblr media
The good news is the breast cancer is far less variable but a standard dev of 22 is still quite wide.  The lack of skew (i.e. close to 1 if my stats knowledge is correct or may be completely wrong) is interesting as it appears to be close to a normal curve.
Next is alcohol consumption:
Tumblr media Tumblr media Tumblr media
We have a standard dev really close to the mean 4 for standard dev and 6 for mean so we cant really even have 2 standard dev away from the mean.
Finally suicide per 100th
Tumblr media Tumblr media Tumblr media
Now for the fun part are any of these variable related.  First breast cancer per 100k vs income per person
Tumblr media
I know the graph is reversed and breast cancer should be on the y axis while income per person should be on the x axis because income per person should affect breast cancer incidence not the other way around.  i.e. income per person may result in difference in diet, medication (hormone therapy), and better detection which might mean higher diagnosis rates.  As could be seen there is a relationship here and it seems positive.  due to income per person crossing 2 orders of magnitude, it actually should be logarithmic but I don’t know how to do that.  
There might be 2 factors at play.  Higher income means higher breast cancer is quite weird, however confounding variables might be at work.  1st is that caucasians seem to have higher incidence of breast cancer.  Since many high income countries are european this could help explain part of the trend.  Furthermore, estrongen replacement therapy for menopause has been know to have an association with breast cancer (this is part of the reason why hormone replacement therapy has stopped).
Next alcohol consumption vs breast cancer.
Tumblr media
Here is a bit harder to tell.  There is possibly a relationship with a positive correlation but due to the wide distribution this could be anything.
Finally GDP per capita vs suicide per 100k. 
Tumblr media
I don’t know how I should feel about a no answer for this relationship.  Suicide is an incredible tragic and complicated topic with many many variables taking part in the decision.  There are some low income countries with high suicide rates there are also some high income countries with high suicide rates such as the US, Japan and Korea.  Thats all I really want to say about suicide except a statement to any individual contemplating this; you are not alone and there are numbers you can call to just to talk.  Thank you for reading my very long entry, its been a fun class and I hope to see you all again.
0 notes
Data management lesson 3
Last time I found the frequency of income per person (”incomeperperson”), the rate of breast cancer per 100,000 (”breastcancerper100TH”), alcohol consumption in per capita liters of pure alcohol per year (”alcconsumption”) and finally the rate suicide per 100k (”suicideper100TH”) from the gapminder database utilizing the SAS database.  Since each of these variables was continuous, the result was a mess.
This time, I decided to make the data easier to manage by grouping the data by frequency into 5 groups of approximately 20% each utilizing SAS.  These new frequency tables were renamed as breastcancerper100THgroup for the breatcancerper100TH table, GDPpercapita for the incomeperperson table, and alcoholconsumption for the alcconsumption table.  After running the results I did not get the right distribution as the lowest group, the 0-20% quintile, was always too large.  The reason for this was that though SAS recognized the missing data in the gapminder dataset as missing (missing data was helpfully encoded as “.” in the dataset) the new frequency tables did not recognize the missing data as missing and probably substituted some value of 0 or just put the missing data as part of smallest grouping.  The way around this was adding “IF original data set name =. THEN new data set = .;” at the beginning of each set of commands to separate the dataset.  For example
IF breastcancerper100TH=. THEN breastcancerper100THgroup = .;
The code input into SAS is as follows
LIBNAME mydata "/courses/d1406ae5ba27fe300" access=readonly; DATA new; set mydata.gapminder; IF breastcancerper100TH=. THEN breastcancerper100THgroup = .; ELSE IF breastcancerper100TH LE 19.1 THEN breastcancerper100THgroup=1; /*3.9-19.1*/ ELSE IF breastcancerper100TH LE 25.2 THEN breastcancerper100THgroup=2; /*19.5-25.2*/ ELSE IF breastcancerper100TH LE 33.4 THEN breastcancerper100THgroup=3; /*25.9-33.4*/ ELSE IF breastcancerper100TH LE 52.1 THEN breastcancerper100THgroup=4; /*34.2-52.1*/ ELSE breastcancerper100THgroup=5; /*52.5-101.1*/ IF incomeperperson=. THEN GDPpercapita = .; ELSE IF incomeperperson LE 558.06287663 THEN GDPpercapita=1; /* 103.77585724-558.06287663*/ ELSE IF incomeperperson LE 1844.3510276 THEN GDPpercapita=2; /* 561.70858483-1844.3510276*/ ELSE IF incomeperperson LE 4699.4112621 THEN GDPpercapita=3; /* 1860.753895-4699.4112621*/ ELSE IF incomeperperson LE 13577.879885 THEN GDPpercapita=4; /* 4885.0467014-13577.879885*/ Else GDPpercapita=5; /*14778.163929-105147.4377*/ IF alcconsumption=. THEN alcoholconsumption = .; ELSE IF alcconsumption LE 1.64 THEN alcoholconsumption=1; /* 0.03-1.64*/ ELSE IF alcconsumption LE 4.51 THEN alcoholconsumption=2; /*1.86-4.51*/ ELSE IF alcconsumption LE 7.38 THEN alcoholconsumption=3; /*4.71-7.38*/ ELSE IF alcconsumption LE 10.62 THEN alcoholconsumption=4; /*7.6-10.62*/ ELSE alcoholconsumption=5; /*10.71-23.01*/ IF suicideper100TH=. THEN suicideper100k= .; Else IF suicideper100TH LE 4.5511212349 THEN suicideper100k=1; /*0.2014487237-4.5511212349*/ ELSE IF suicideper100TH LE 7.1848526001 THEN suicideper100k=2; /*4.6670246124-7.1848526001*/ ELSE IF suicideper100TH LE 9.927033 THEN suicideper100k=3; /*7.2023835182-9.927033*/ ELSE IF suicideper100TH LE 13.23981 THEN suicideper100k=4; /*10.05932-13.23981*/ ELSE suicideper100k=5; /*13.548419952-35.752872467*/ PROC SORT; by COUNTRY; PROC Freq; tables breastcancerper100THgroup GDPpercapita alcoholconsumption suicideper100k; Run;
I am separating the results into 2 parts.  The original tables are in the google document:
link https://drive.google.com/file/d/0BzZBRwR8MdaMTlV2RkpxdmdvV0k/view?usp=sharing  
the new tables:
Tumblr media
Since I do not know how to substitute the range into the frequency distribution tables: 
breastcancerper100TH Quintile: 
1 = 3.9-19.1 breast cancer per 100k per year; 
2 = 19.5-25.2 breast cancer per 100k per year; 
3=25.9-33.4 breast cancer per 100k per year;
4=34.2-52.1 breast cancer per 100k per year and 
5 = 52.5-101.1 breast cancer per 100k per year
Incomeperperson Quintile: 
1 = 103.77585724-558.06287663 GDP per capita
2 = 561.70858483-1844.3510276 GDP per capita;
3 = 1860.753895-4699.4112621 GDP per capita;
4 = 4885.0467014-13577.879885 GDP per capita;
5 = 14778.163929-105147.4377 GDP per capita
Alcohol consumption quintiles: 
1 = 0.03-1.64 L per person per year; 
2 = 1.86-4.51 L per person per year; 
3 = 4.71-7.38 L per person per year
4= 7.6-10.62 L per person per year
5 = 10.71-23.01 L per person per year 
Suicides per 100k
1 = 0.2014487237-4.5511212349 suicides per 100k per year
2 = 4.6670246124-7.1848526001 suicides per 100k per year
3= 7.2023835182-9.927033 suicides per 100k per year
4 = 10.05932-13.23981 suicides per 100k per year
5 = 13.548419952-35.752872467 suicides per 100k per year
0 notes
running gapminder through sas
I ran the sas program for 4 data sets.  The first is breast cancer per 100k (called breastcancerper100k).  The second is by income per capita in dollars (GDPpercapita).  The third is alcohol consumption (alcoholconsumption).  Finally suicides per 100k (suicidesper100k).  Unfortunately since each of these are continuous variables instead of categorical variables; the frequency data is a mess.
here is the program 
Tumblr media
since I don’t know how to upload PDFs and the tables created are huge I am creating a google doc link https://drive.google.com/file/d/0BzZBRwR8MdaMTlV2RkpxdmdvV0k/view?usp=sharing
There isn’t any frequency results that can be gleaned just yet from this data unfortunately as a result of all these statistics being continuous variables so each country can have a unique value per variable.  However, I intend to do regression analysis on the data to determine correlations.
0 notes
Data Science + Analytics lesson 1
Data Science + Analytics Lesson 1:
I enrolled in the Data Science and Analytics certificate program to try and gain knowledge of how to work the SAS program or Python.  As I am a public health major with a focus on health policy and management I was interested in the intersection of health and economics.  The only codebook I could find which had a global focus on health policy was the GapMinder dataset/database.
My question that I wish to pursue throughout this program is utilizing the data from the gapminder database to answer a question of whether GDP per capita (variable name of incomeperperson) is correlated with breast cancer incidence per 100,000 (breastcancerper100TH) and if the data is correlated in which direction is the correlation.  The GDP per capita is taken from the World Bank Work Development Indicators and states the 2010 GDP per capita in constant 2000 US dollars, it does not take into account difference in cost of living.  The breast cancer per 100,000 comes from the International Agency for Research on Cancer and is the 2002 breast cancer new cases per 100,000 females. The number of new cases of a disease divided by the total population at risk (in this case females because they are the ones to get breast cancer) is called in epidemiology: incidence and this is the term I will utilize in the rest of this post.
A second topic I might pursue but have not done research yet for is suicide incidence vs income.
To begin the literature search I utilized EBSCO Discovery Service (EDS) with search criteria: “Breast Cancer Statistics”.  Krieger et al found that in the United States there was a clear pattern of breast cancer incidence when correlated with race and income.  The breast cancer incidence for each race between 2006 and 2010 was: 84.7 for Asians, 90.3 for Native Americans, 91.1 for Hispanics, 118.4 for African Americans and 127.3 for Non-Hispanic Whites (1).  Furthermore, high income counties had higher incidence of breast cancer compared to lower income counties.  This correlation held between all racial groups except in the case of non-Hispanic African Americans over age 70 (1).  The authors speculated that one of the causes of this correlation was due to the usage of Hormone Replacement therapy in post menopausal women (1).  More recent data (2008-2012) by Desantis showed that the incidence rates based on race are converging with incidence rates rising amongst African American and Asian women while staying stable among white, Hispanic and Native American populations (2).    
Given this correlation and the fact that the data is taken in 2002, I am expecting to see a strong positive correlation between Breast cancer and income per capita.
 Works Cited:
1)      Krieger N, Chen J, Waterman P. Decline in US Breast Cancer Rates After the Women's Health Initiative: Socioeconomic and Racial/Ethnic Differentials. American Journal Of Public Health [serial online]. April 2, 2010;100:S132-S139. Available from: Education Source, Ipswich, MA. Accessed September 24, 2017.
2)      DeSantis C, Fedewa S, Goding Sauer A, Kramer J, Smith R, Jemal A. Breast cancer statistics, 2015: Convergence of incidence rates between black and white women. CA: A Cancer Journal For Clinicians [serial online]. January 2016;66(1):31-42. Available from: MEDLINE with Full Text, Ipswich, MA. Accessed September 24, 2017.
0 notes