thoughtfullyinstantninja-blog
thoughtfullyinstantninja-blog
Data Management, Analysis and Visualization Blogs
8 posts
Don't wanna be here? Send us removal request.
Text
Week 4 - K means Cluster analysis
For K means cluster analysis I chose a different data set from my previous assignments. I chose the “Cereal” data set from kaggle.com (https://www.kaggle.com/crawford/80-cereals). The dataset consists of different cereals as observations with features describing each cereal. The features I have used in my analysis is as follows:  1. calories  2. protein  3. fat  4. sodium  5. fiber  6. carbo  7. sugars 8.  potass  9. vitamins  10. shelf  11. weight  12. cups  13. rating
We also have other features namely the name: Cereal name, mfr: Cereal manufacturer code and type: type of cereal. Though these are identifying features of the observations and hence not used in our analysis here.
Code:
libname BH "/home/nishamathibits0/MyData"; proc import datafile='/home/nishamathibits0/MyData/cereal.csv' out=BH.cereal replace dbms=csv; datarow=3; run; data clust; set mydata.cereal; idnum=_n_; keep calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating idnum; * delete observations with missing data; if cmiss(of _all_) then delete; run; ods graphics on; /* No need of 2 datasets for this assignment proc surveyselect data=clust out=traintest seed = 123 samprate=0.7 method=srs outall; run;   data clus_train clus_test; set traintest; if selected=1 then output clus_train; else output clus_test; run;*/ * standardize the clustering variables to have a mean of 0 and standard deviation of 1; proc standard data=clust out=clustvar mean=0 std=1; var calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups; run; %macro kmean(K); proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300; var calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups; run; %mend; %kmean(1); %kmean(2); %kmean(3); %kmean(4); %kmean(5); %kmean(6); %kmean(7); %kmean(8); %kmean(9); * extract r-square and ccc (cubic clustering criterion) values from each cluster solution and then merge them to plot elbow curve; %macro myplot(i); data clus&i.rsq clus&i.ccc; set cluststat&i.; nclust=&i.; if _type_='RSQ' then output clus&i.rsq; else if _type_='CCC' then output clus&i.ccc; *keep nclust over_all; run; %mend; %myplot(1); %myplot(2); %myplot(3); %myplot(4); %myplot(5); %myplot(6); %myplot(7); %myplot(8); %myplot(9); data clusrsq (keep=nclust over_all); set clus1rsq clus2rsq clus3rsq clus4rsq clus5rsq clus6rsq clus7rsq clus8rsq clus9rsq; run; data clusccc (keep=nclust over_all); set clus1ccc clus2ccc clus3ccc clus4ccc clus5ccc clus6ccc clus7ccc clus8ccc clus9ccc; run; * plot elbow curve using r-square values; symbol1 color=blue interpol=join; proc gplot data=clusrsq; plot over_all*nclust; run; * plot elbow curve using ccc values; symbol1 color=green interpol=join; proc gplot data=clusccc; plot over_all*nclust; run; quit; ***************************************************************************************** further examine cluster solution for the number of clusters suggested by the elbow curve ***************************************************************************************** * plot clusters for 4 cluster solution; proc candisc data=outdata7 out=clustcan; class cluster; var calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups; run; proc sgplot data=clustcan; scatter y=can2 x=can1 / group=cluster; run; * validate clusters on rating; * first merge clustering variable and assignment data with GPA data; data rating_data; set clus_train; keep idnum rating; run; proc sort data=outdata7; by idnum; run; proc sort data=rating_data; by idnum; run; data merged; merge outdata7 rating_data; by idnum; run; proc sort data=merged; by cluster; run; proc means data=merged; var rating; by cluster; run; proc anova data=merged; class cluster; model rating = cluster; means cluster/tukey; run;
Analysis:
A series of k-means cluster analyses were conducted on the whole data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret. I have also used cubic clustering criterion (ccc) values for all the clusters to help interpret how many clusters I need to chose for my current data set. It is understood from theory that ccc values when they decrease, we tend to choose that particular cluster number because ccc takes into account a local maxima. So here I choose my number of clusters as 7 for my interpretation of the clusters created.
Elbow curve of Rsq Vs Clusters
Tumblr media
Plot of CCC Vs clusters
Tumblr media
Canonical discriminant analyses was used to reduce the clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (shown below) indicated that the observations in clusters 7 were densely packed with relatively low within cluster variance, however has an overlap with cluster 5. Cluster 1 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 6 were spread out more than the other clusters, showing high within cluster variance, probably outliers. The results of this plot suggest that the best cluster solution may have fewer than 7 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 7 clusters.
Tumblr media
The following interpretation of cluster means is for a 7 cluster solution.
Tumblr media
As we can see from the above cluster means, cluster 1 which has 14 observations have highest calories,fat content and weight. Most of the Bran based cereals fall under this category depending on their sugar and carbohydrate content. Cluster 2 cereals seems like a good fit for those who want high protein content with the sugars, carbs etc down. 100%natural Bran and Quaker oatmeal are the only 2 cereals that fall under this category. Considering what we know about the goodness for a human body, this by far seems like the best cluster cereal for those looking to eat healthy. However, the shelf life for these 2 brands are much less than the other cereals in the data set. Out of curiosity, I looked into the cereals I use (Fruit pebbles and fruit loops), they fall in cluster 7, which basically has nothing but sugar. That is sugar content is all the cereals in this cluster have. There are 37 cereals in this category, some of them are Corn Pops, CapnCrunch, Golden Grahams, Raisin Nut Bran, Smacks etc.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on rating (received). A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on rating (F(6, 70)=7.02, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on rating. Cereals in cluster 4 had the highest rating(mean=73.8, sd=14.51), and cluster 7 had the lowest rating followed closely by cluster 1. Cluster 1 even though seems to have the bestest quality/ingredients healthwise, has a lower rating probably because it does’nt taste as good as the sugary cluster, which is cluster 7. Cluster 7 has low rating for the reason that it has its main ingredient as just sugar alone. In all both cluster 1 and cluster 7 has low rating but for their own qualities in the cereal. Rating since it is human based it might not be a good validator for our clusters in this data set. Nevertheless it gives some insight in the interpretation of these clusters.
Tumblr media
0 notes
Text
Week 3 Assignment
Post edited to include the code that was used to perform the analysis:
LIBNAME mydata "/home/nishamathibits0/MyData"; %let DT_dat "/home/nishamathibits0/MyData/ContraceptiveDataset.txt";
data contracep; infile '/home/nishamathibits0/MyData/ContraceptiveDataset.txt' dsd  dlm=',' lrecl= 50000; input WAge:3. Wedu:1. Hedu:1. Child:3. Wreligion:1. Wworking:1. Hoccup:1. SOLindx:1. Media:1. Contra:1.; run; ods graphics on; * Split data randomly into test and training data; proc surveyselect data=contracep out=traintest seed = 100 samprate=0.7 method=srs outall; run;  
* lasso multiple regression with lars algorithm k=5 fold validation; proc glmselect data=traintest plots=all seed=123;     partition ROLE=selected(train='1' test='0');     model contra = Wage Wedu Hedu Child Wreligion Wworking Hoccup SOLindx Media/selection=lar(choose=cv stop=none) cvmethod=random(5); run;
A lasso regression analysis was conducted to identify a subset of variables from a pool of 9 categorical and quantitative predictor variables that best predicted a categorical response variable measuring "Usage of Contraceptive in married couple”. Categorical predictors included Husband education, Wife education, Wife’s religion, Wife currently working, Husband occupation, Standard of Living index, media exposure (binary), and quantitative predictors Wife age, and Number of children, were used as predictors. 
Data were randomly split into a training set that included 70% of the observations (N=1032) and a test set that included 30% of the observations (N=441). 
Tumblr media
The least angle regression algorithm with k=5 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variable.
Tumblr media
Summary
1. Of the 9 predictor variables, 7 were retained in the selected model. During the estimation process, Women’s age, Women’s education Level, media exposure, number of children are most closely correlated with the response variable “Usage of Contraception” followed by Standard of living index, women working, and woman’s religion. Woman’s age and Media were negatively associated with "Use of contraception” and Women’s education, media exposure, number of children etc., were all positively associated.  
Tumblr media Tumblr media
2. The current model only explains 10% of variance (Adjusted R squared) in the data. But at the same time, the test average squared error goes down by 8% as the model complexity increases. Overall this method of variable selection might not have been the best method for this data set due to the variance being very high and the model not being able to explain the variance with the given set of parameter predictors.
Tumblr media Tumblr media
0 notes
Text
Week 2 - Random forest
The code for running a random forest tree is shown below for my dataset. the goal of my tree prediction is to answer the quetion, “Will a couple be using contraceptives, given certain attributes/features we know about them”. This is a simple classification problem. In week 1, I have discussed the data set reading and managing (please refer the previous blog if interested in the data set itself and what kind of data management is involved.) In brief, the following explanatory variables were included as possible contributors to a binary classification tree model evaluating the Contraceptive Method of choice (my response variable), Wife’s age, Wife’s education, Husband’s education, Number of children ever born, Wife’s religion, Wife’s now working?, Husband’s occupation, Standard-of-living index, and Media exposure.
Tumblr media
The model results are as follows: The number of variables sampled in each node split was left at default (5). 60% of the data was provided for training the ensemble decision tree models and 100 decision trees were run with 5 variables randomly selected at every split or node. There are no missing data and hence all observations were used. 
Tumblr media
The top 19 tree fit statistics for my data set is shown below:
Tumblr media
Summary:
Tumblr media
The explanatory variables with the highest relative importance scores were No. of children, Whether the wife had a college education or not, media exposure etc.,.  The accuracy of the random forest was 74% (OOB), with the subsequent growing of multiple trees of  ~20 or so rather than a single tree, and suggesting that interpretation of a single decision tree may be appropriate. The following plot shows how the misclassification rate decreases in a training vs OOB data in this data set. As we can see clearly there is no significant decrease in misclassification rate beyond 20 trees or so in OOB data as it levels beyond 20 trees. We can also see that the misclassification rate is much lower in training vs OOB data but this is expected.  However we can also argue that a misclassification rate of 30% (70% accuracy) in a OOB dataset is pretty good considering it is a new testing data set. In conclusion, for our sample dataset, I am inclined to say that the random forest approach is better as the misclassification rate of OOB data is 30% while the misclassification of the whole dataset in a single decision tree procedure is 26%. This is a small trade off when we want our model to do better predictions in general cases. I would use the random forest approach to predict new dataset class values in this case.
Tumblr media
0 notes
Text
M/C Learning for data analysis - Week 1
For my classification analysis, I chose a data set that is available @ UC Irvine here. 
Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables: 2 continuous variables,  3 binary, 4 categorical response variables, which were converted as binary response variables. For the present analysis, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree. The complete code is as shown below.
Step 1: Importing the data.
Tumblr media
Step 2: Data management to set it up for analysis.
Tumblr media
Step 3: Analysis-Classification using Decision trees
Tumblr media
The following explanatory variables were included as possible contributors to a binary classification tree model evaluating the Contraceptive Method of choice (my response variable), Wife's age, Wife's education, Husband's education, Number of children ever born, Wife's religion, Wife's now working?, Husband's occupation, Standard-of-living index, and Media exposure.
The outputs are as follows:
Tumblr media Tumblr media
Summary:
The “Number of children ever born” was the first variable to separate the sample into two subgroups. If the husband and the wife had at least 1 child (>=0.160 is the exact cut off) then they are more inclined to use a short or long term contraceptive method.
Of the couples who had at least 1 child, a further subdivision was made with the dichotomous variable of “College level education for the wife - i.e Wife Education category college”. This was further classified using “Number of children ever born” if the wife had college level education and by “Wife’s Age” if the wife did not have college level education as we can see from the tree. The total model classified 74% (100- Misclassification)of the sample correctly, 86% of non contraceptive users (sensitivity) and 57% of who will use contraceptives (specificity).
0 notes
Text
Week 4 Assignment
I have created both bar charts and scatter plots to understand the correlation between the variables I am interested in.
Code:
Tumblr media
Outputs:
Plot 1:
Tumblr media
This plot is a scatter plot showing how the alcohol consumption rate is correlated with number of breast cancer cases in all countries. It seems to have a linear relationship if we eliminate any outlier countries. That is as the number of breast cancer cases increase the alcohol consumption increases as well. However, there are some country data points where this is not true which can be considered as outliers.
Plot 2:
Tumblr media
This plot is a scatter plot showing how the CO2 emission rates are correlated with number of breast cancer cases in all countries. It does not seem to have any relation to how many breast cancer cases were detected in all countries.
Plot 3:  
Tumblr media
Plot 4:
Tumblr media
Plots 3 and 4 are vertical and horizontal bar charts of alcohol consumption (binned data) and CO2 emissions (the variables I am analyzing). Since this is a binned data and the bins were decided on the percent of data falling in each category, the bar charts might not be of interest for my analysis. However, to get an idea of the data distributed visually this is presented and also for the purpose of this assignment where part of evaluation is based on univariate graphs this is presented here.
Legend
M - Missing data
LT5 - Number of people reporting less than 5 liters of alcohol consumption in different countries.
GT20 -  Number of people reporting greater than 20 liters of alcohol consumption in different countries.
BTW 5&10 -  Number of people reporting between 5 liters and 10 liters of alcohol consumption in different countries.
 BTW 10&15 -  Number of people reporting between 10 liters and 15 liters of alcohol consumption in different countries.
BTW 15&20 -  Number of people reporting between 15 liters and 20 liters of alcohol consumption in different countries.
LT 1M - Less than 1 million metric tonnes of CO2 emissions in different countries
BTW 1 &100M - Number of countries who had CO2 emission between 1 Million metric tonnes to 100 Million metric tonnes.
BTW 100M&1B -  Number of countries who had CO2 emission between 100 Million metric tonnes to 1 Billion metric tonnes.
GT 1B -  Number of countries who had CO2 emission greater than 1Billion metric tonnes.
In summary,
Alcohol consumption rate in different countries seem to be a contributing factor for the number of reported breast cancer cases per 100 thousand population in a country. While CO2 emissions do not seem to have any relation to the number of breast cancer cases in a country.
0 notes
Text
Week 3 - Coursera Assignment Gapminder dataset
I wrote a code to calculate frequency of three variables in their binned data form. Since the variables I have chosen are continuous, and for the frequency procedure results to make more sense, I have binned the continuous variable data in several bins based on the data itself by using if conditions. For e.g. alcohol consumption of a country is reported in liters so the values range anywhere from 0.03 liters to 23.01 liters in an increment of as low as 0.01 or more. Therefore I have binned this variable in SAS to categorize, such that any observation with alcohol consumption value less than 5 will be assigned “Less than 5 liters” alcohol consumption as value, so on and so forth. The proc freq outputs are for binned data of the 3 variables (Alcohol consumption, BreastCancer cases per 100 thousand and CO2 emissions).
The code is as follows.
Tumblr media
The outputs for the three variables:
Tumblr media
In summary,
I collapsed the responses for alcohol consumption, Number of breast Cancer cases per 100 thousand females, and CO2 emissions in metric tonnes, to create three new variables: alconscat, brstcanper100, and cardioxemissions respectively. For alconscat, it can be inferred that most countries have an alcohol consumption rate of less than 5 liters (38.03%). For brstcanper100, it can be inferred that most countries have breast cancer cases per 100 thousand in the range of 10 to 30 cases (38.50%). For cardioxemissions, it can be inferred that most countries have CO2 emissions in the range of more than 10*10^9 metric tonnes (this constitutes about 30.05% of the whole dataset). In all, we can conclude based on the frequency distributions of available data of the three variables that the highest percentage of cases constitute about 30 to 40% of the whole dataset for all 3 variables.
0 notes
Text
Week 2 - Coursera Assignment Gapminder dataset
I have shown below a code to calculate frequency of three variables in their raw data form and binned data form. Since the variables I have chosen are continuous, and for the frequency procedure results to make more sense, I have binned the continuous variable data in several bins based on the data itself. For e.g. alcohol consumption of a country is reported in liters so the values range anywhere from 0.03 liters to 23.01 liters in an increment of as low as 0.01 or more. Therefore I have binned this variable in SAS to categorize, such that any observation with alcohol consumption value less than 5 will be assigned “Less than 5 liters” alcohol consumption as value, so on and so forth..I have shown proc freq partial outputs for 1 variable’s raw data (alcohol consumption), the code does run for all the raw data variables. Rest of the proc freq outputs are for binned data of the 3 variables (Alcohol consumption, BreastCancer cases per 100 thousand and CO2 emissions)
Code
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
data new; set mydata.gapminder; length alconscat $50 brstcanper100 $50 cardioxemissions $100; label alcconsumption = "Alcohol Consumption in liters - raw data " breastcancerper100th="Breast cancer cases Per 100 thousand females - raw data" co2emissions="CO2 emissions in metric tonnes - raw data" alconsCat="Alcohol Consumption in liters - binned/categorised data" cardioxemissions="CO2 emissions in metric tonnes - binned/categorised data" brstcanper100="Breast cancer cases Per 100 thousand females - binned/categorised data"; if alcconsumption eq . then alconscat= "Missing Data"; else if alcconsumption le 5 then alconsCat="Less than 5 liters"; else if alcconsumption gt 5 and alcconsumption le 10 then alconsCat="Between 5 and 10 liters"; else if alcconsumption gt 10 and alcconsumption le 15 then alconsCat="Between 10 and 15 liters"; else if alcconsumption gt 15 and alcconsumption le 20 then alconsCat="Between 15 and 20 liters"; else alconsCat="Greater than 20 liters"; if breastcancerper100th eq . then brstcanper100 = "Missing data"; else if breastcancerper100th le 10 then brstcanper100="Less than 10 cases"; else if breastcancerper100th gt 10 and breastcancerper100th le 30 then brstcanper100="Between 10 and 30 cases"; else if breastcancerper100th gt 30 and breastcancerper100th le 50 then brstcanper100="Between 30 and 50 cases"; else if breastcancerper100th gt 50 and breastcancerper100th le 70 then brstcanper100="Between 50 and 70 cases"; else if breastcancerper100th gt 70 and breastcancerper100th le 90 then brstcanper100="Between 70 and 90 cases"; else brstcanper100="Greater than 90 cases"; if co2emissions eq . then cardioxemissions="Missing Data"; else if co2emissions le 1000000 then cardioxemissions="Less than 1,000,000 metric tonnes"; else if co2emissions gt 1000000 and co2emissions le 10000000 then cardioxemissions="Between 1,000,000 and 10,000,000 metric tonnes"; else if co2emissions gt 1000000 and co2emissions le 10000000 then cardioxemissions="Between 10,000,000 and 100,000,000 metric tonnes"; else if co2emissions gt 10000000 and co2emissions le 100000000 then cardioxemissions="Between 100,000,000 and 1,000,000,000 metric tonnes"; else if co2emissions gt 100000000 and co2emissions le 1000000000 then cardioxemissions="Between 1,000,000,000 and 10,000,000,000 metric tonnes"; else cardioxemissions="Greater than 10,000,000,000 metric tonnes"; run;
proc sort data=new; by Country; run;
proc freq; tables alcconsumption breastcancerper100th co2emissions alconsCat brstcanper100 cardioxemissions; run;
Output
Variable 1 : Alcohol Consumption-raw data 
Tumblr media
Variable 1 : Alcohol Consumption-binned data
Tumblr media
Variable 2 : Breast cancer cases per 100 thousand females-binned data
Tumblr media
Variable 3 : CO2 emissions in metric tonnes-binned data
Tumblr media
0 notes
Text
1. The research question I have chosen for my project is “Do CO2emissions in a country have any correlation with the number of breast cancer cases identified? And are there other factors in a country that affect the number of breast cancer cases diagnosed?”
2. I am using Gapminder dataset for my purposes.
3. My codebook variables of interest from Gapminder dataset are as follows (only variable names are being listed here) :
a. co2emissions b. breastcancerper100th c. HIVrate d. alcoholconsumption e. annualHIVdeath (available in the gapminder website)
4. My literature review search terms to see what research have been done on this subject are: “alcohol consumption breast cancer”, “social conditions leading breast cancer”, “hiv and breast cancer”.
5. Literature Summary : Along with a known set of risk factors (such as age, sex, race) for breast cancer, alcohol consumption on a regular basis of the quantity 15 to 30 g or more per day does increase the chance of breast cancer in a patient. A retrospective study summarized that the association of HIV with breast cancer occurence is unclear from their data. They had a minimal dataset of 305 patients only. 
6. My hypothesis is that, one factor alone might not be influencing the rate of breast cancer in a patient, however a combination of the above mentioned factors might be influential in increasing or decreasing the breast cancer rate. In conclusion, I expect to find a strong correlation between the above factors to breastcancerper100th variable.
References:  1.  King, R. D., et al. "The effect of occlusion on carbon dioxide emission from human skin." Acta dermato-venereologica 58.2 (1977): 135-138.
2.  Oluwole, S. F., Ali, A. O., Shafaee, Z. and Depaz, H. A. (2005), Breast cancer in women with HIV/AIDS: Report of five cases with a review of the literature. J. Surg. Oncol., 89: 23–27. doi:10.1002/jso.20171
3.  Willett, Walter C., et al. "Moderate alcohol consumption and the risk of breast cancer." New England Journal of Medicine 316.19 (1987): 1174-1180.
0 notes