jamie-turrin-blog - Tumblr blog

jamie-turrin-blog · 8 years ago

Text

Machine Learning K-Means Clustering

In all previous assignments I have used the Mars Crater data set, but this data set has mostly categorical variables, and only a couple numerical variables. So the Mars Crater data set is not well suited for K-means. Therefore I have chosen to use a data set downloaded from the Univ. of Californaia Berkeley Machine Learning Lab. The data is a set of 180 observations of 79 variables that record body motion data using a Samsung smartphone. There are 6 activities that were studied, laying, sitting, standing, walking, walking downstairs, and walking upstairs. I will use K-means clustering methods to see if I can divide the data into 6 separate clusters, one for each activity.

Because I know there should be 6 clusters, I do not need to run the clustering procedure multiple times to find the best number of clusters. I just need to run the clustering procedure once, using k=6, and then determine if the 6 clusters are significantly different from each other. If the clusters are significantly different, then the body motions associated with each activity are significantly different and the clusters could be used to classify future data to determine a person’s type of motion.

The data are first standardized, the run through the FASTCLUS procedure using k=6. The output of the clustering procedure is then fed into the CANDISC procedure to determine the first several canonical discriminants (derived values with the most information) to be used for plotting purposes to visually determine if cluster are real. After plotting, the training data and cluster data are sorted, merged and put through the ANOVA procedure to test the significant of each cluster.

The plot of Can1 vs Can2 shows clusters 2, 5, and 6 are not well separated, and cluster 3 only has a single observation. The plot of Can3 vs Can4 shows better separation of the clusters. The plot of Active vs Cluster (produced using the ANOVA procedure) shows that activities 4 and 5 (walking and walking downstairs) are not represented in any of the clusters, while walking upstairs (activity 6) is spread among clusters 1, 2, 5, and 6. Likewise laying (activity 1) is spread between clusters 3 and 4. Lastly, cluster 4 included activities 1, 2, and 3 (laying, sitting, standing). Clearly, the clusters found by the K-means technique do not uniquely identify the 6 activities.

DATA train;

* where to find data, skip first line, data begins at line 2;

INFILE '/home/jturrin0/Body_Motion_Means.txt' FIRSTOBS = 2;

* how to read the data, Activity is in columns 1-22, then 79 numeric variables;

INPUT Activity $ 1-22 var1-var79;

*Create new variable with 6 levels, based on activity;

IF FIND(Activity, 'LAYING') GE 1 THEN Active = 1;

IF FIND(Activity, 'SITTING') GE 1 THEN Active = 2;

IF FIND(Activity, 'STANDING') GE 1 THEN Active = 3;

IF FIND(Activity, 'DOWNSTAIRS') GE 1 THEN Active = 4;

IF FIND(Activity, 'UPSTAIRS') GE 1 THEN Active = 5;

IF FIND(Activity, 'WALKING') GE 1 THEN Active = 6;

idnum = _N_; * id for merging datasets later;

RUN;

* standardize training data;

PROC STANDARD DATA=train OUT=train_standardized MEAN=0 STD=1;

VAR var1-var79;

RUN;

* Since this is a supervised cluster analysis, I know before hand there are 6 clusters;

* one cluster for each activity, so I don't need to perform clustering for k=1,2,3,4,5;

* Just run clustering for k=6;

PROC FASTCLUS DATA=train_standardized

OUT=cluster_data

OUTSTAT=cluster_stats

MAXCLUSTERS=6

MAXITER=300;

VAR var1-var79;

RUN;

* compute 1st and 2nd canonical discriminant variables for plotting purposes;

PROC CANDISC DATA=cluster_data ANOVA OUT=canonical_data;

CLASS cluster; *categorical variable, cluster number;

VAR var1-var79;

RUN;

* plot canonical variables to see clusters;

PROC SGPLOT DATA=canonical_data;

SCATTER Y=Can2 X=Can1 /

MARKERATTRS = (SYMBOL = CIRCLEFILLED SIZE = 2MM)

GROUP=Cluster;

TITLE 'Canonical Variables Identified by Cluster';

RUN;

* better cluster separation is seen by plotting Can4 vs Can3;

PROC SGPLOT DATA=canonical_data;

SCATTER Y=Can4 X=Can3 /

MARKERATTRS = (SYMBOL = CIRCLEFILLED SIZE = 2MM)

GROUP=Cluster;

TITLE 'Canonical Variables Identified by Cluster';

RUN;

* sort datasets by idnum before merging;

PROC SORT DATA=train; BY idnum; RUN;

PROC SORT DATA=cluster_data; BY idnum; RUN;

* merge datasets so I can run ANOVA on clustering results to see if clusters are significant;

DATA merged;

MERGE train cluster_data;

BY idnum;

RUN;

* Run ANOVA to see if there are significant differences between clusters;

PROC ANOVA DATA=merged;

CLASS cluster Active; * categorical variable;

MODEL Active=cluster;

MEANS cluster/tukey;

RUN;

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Text

Machine Learning Assignment 3 LASSO Regression

This week I used LASSO Regression to model a quantitative response variable (crater depth) using two quantitative predictor variables (crater diameter, latitutude) and three categorical predictor variables (rampart edge, hummocky texture, circular outline).

The procedure split the original dataset into 2 sets, a training set with 70% of the observations, and a test set with 30%. Cross validation was used to find the best set of predictor coefficients that produced the lowest model error. The cross validation method was k-fold, using 10 folds and all variables.

The LASSO procedure will choose the best predictor variables from among many, in this case all variables in the model were chosen, as signified by the asterix placed next to the final variable (circular_outline) in the LAR Selection Summary table (see below). The most important predictor variable is the square root of the crater diameter, followed by absolute value of latitude, then rampart_edge, hummocky_texture, and lastly circular_outline. The test average squared error achieved a minimum of 0.043 when all 5 predictor variables are included in the model. The graph of ASE for the training and test sets shows the average squared error for both were very similar, indicating the model is neither under- nor overfit.

The graph of coefficient progression shows how the predictor coefficients evolve as more variables are added to the model. Ultimately, it shows that square root of crater diameter and absolute value of latitude have coefficients of greatest magnitude, indicating they have the most influence on the response variable. The addition of the three categorical variables adds litle to the model, but they do slightly lower the average squared error, so they are included in the final model. The graph of of CVPRESS illustrates this fact, as the residual sum of squares markedly decreases with crater diameter and latitude, but changes very little when the categorical variables are included in the model.

The final model explains just over 70% of the variability in crater depth (R-squared = 0.708) and has the following form:

Crater_Depth = -0.0345 + 0.293*Square_Root(Crater_Diameter) –

0.0065*Absolute_Value(Latitude) + 0.027*Circular_Outline -

0.0355*Hummocky_Texture - 0.0406*Rampart_Edge

Here is my SAS code:

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

DATA new; set mydata.marscrater_pds;

/* exclude empty rows, and rows I won't use in analysis */

IF morphology_ejecta_1 ^= 'Rd'; /* exclude radial ejecta */

IF morphology_ejecta_1 ^= ' '; /* exclude empty rows in ejecta 1 */

IF morphology_ejecta_2 ^= ' '; /* exclude empty rows in ejecta 2 */

/* create new variable to categorize ejecta_1 as rampart_edge or not */

/* this will be the binary categorical response variable */

IF FIND(morphology_ejecta_1,'R') GE 1 THEN rampart_edge = 1; /* rampart edge */

ELSE rampart_edge = 2; /* not a rampart_edge */

/* create new explanatory variable to categorize ejecta_1 as circular or not */

IF FIND(morphology_ejecta_1,'C') GE 1 THEN circular_outline = 1; /* circular outline */

ELSE circular_outline = 2; /* not a circular outline */

/* create new explanatory variable to categorize ejecta_2 by texture */

IF FIND(morphology_ejecta_2,'Hu') GE 1 THEN hummocky_texture = 1; /* hummocky lobes */

ELSE hummocky_texture = 2; /* not hummocky */

* Latitude will also be an explanatory variable;

* Use absolute value of latitude, assuming relationship between latitude and depth is;

* symmetric about equator;

lat_abs = abs(latitude_circle_image);

* testing shows diameter is not linearly related to depth;

* use square root instead;

diam_sqrt = sqrt(diam_circle_image);

RUN;

ODS GRAPHICS ON; * turn on Output Delivery System graphics;

* Randomly split dataset into testing and training datasets;

* 70% of data will be in training data,;

* Use Simple Random Sampling with seed 456;

* OUTALL creates output dataset with both train and test data together, with;

* a variable indicating which observations were used in training and which are in test set;

PROC SURVEYSELECT DATA=new OUT=traintest SEED=456 SAMPRATE=0.7 METHOD=srs OUTALL;

RUN;

* Linear Regression modeling with LASSO (LAR);

* Produce plots associated with Lasso;

* Uses same seed as SURVEYSELECT procedure above;

* data is partitioned by ROLE, with 1=training and 0=testing;

* cross validation is performed for all variables using k-fold with 10 folds;

PROC GLMSELECT DATA=traintest PLOTS=all SEED=456;

PARTITION ROLE=selected(train='1' test='0');

MODEL depth_rimfloor_topog = diam_sqrt lat_abs circular_outline hummocky_texture rampart_edge/

SELECTION=LAR(choose=cv stop=none) CVMETHOD=random(10);

RUN;

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Text

Machine Learning Assignment 2

This week I used a random forest to classify crater ejecta morphology as either having a rampart edge, or not, using latitude, longitude, crater diameter, crater depth, hummocky texture, and circular outline as explanatory variables. In this model, rampart edge is a categorical binary variable, coded as either 1 (rampart edge) or 2 (not rampart edge). Latitude, longitude, depth, and diameter are quantitative, and hummocky texture and circular outline are categorical.

I used the default settings for the SAS HPFOREST procedure to produce 100 decision trees, 60% of the data was used as training data, 40% as test data, the Gini method for splitting nodes, 19,476 observations, and no pruning.

The resulting model achieved an Out Of Bag Misclassification rate of 31.2%. This is a small improvement compared to the use of a single decision tree, which had an overall misclassification rate of 33% (see last week’s results).

From this model we can see that the most important variables, those that contribute most to correctly classifying crater ejecta, are hummocky texture and circular outline. Thus, rampart edges most often occur in craters that have a hummocky texture and a circular outline. Crater depth, diameter, location (lat. and long.) are much less associated with rampart edges and not good predictors of crater edge type.

* Machine Learning Assignment 2

* Use a random forest to classify a binary categorical response variable using;

* categorical or quantitative explanatory variables;

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

DATA craters; set mydata.marscrater_pds;

/* exclude empty rows, and rows I won't use in analysis */

IF morphology_ejecta_1 ^= 'Rd'; /* exclude radial ejecta */

IF morphology_ejecta_1 ^= ' '; /* exclude empty rows in ejecta 1 */

IF morphology_ejecta_2 ^= ' '; /* exclude empty rows in ejecta 2 */

/* create new variable to categorize ejecta_1 as rampart_edge or not */

/* this will be the binary categorical response variable */

IF FIND(morphology_ejecta_1,'R') GE 1 THEN rampart_edge = 1; /* rampart edge */

ELSE rampart_edge = 2; /* not a rampart_edge */

/* create new explanatory variable to categorize ejecta_1 as circular or not */

IF FIND(morphology_ejecta_1,'C') GE 1 THEN circular_outline = 1; /* circular outline */

ELSE circular_outline = 2; /* not a circular outline */

/* create new explanatory variable to categorize ejecta_2 by texture */

IF FIND(morphology_ejecta_2,'Hu') GE 1 THEN hummocky_texture = 1; /* hummocky lobes */

ELSE hummocky_texture = 2; /* not hummocky */

RUN;

PROC HPFOREST;

TARGET rampart_edge/level=nominal; * binary response variable;

INPUT circular_outline hummocky_texture/level=nominal; * binary explanatory variables;

INPUT latitude_circle_image longitude_circle_image

diam_circle_image depth_rimfloor_topog/level=interval; * quantitative explanatory variables;

RUN;

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes

jamie-turrin-blog · 8 years ago

Text

Machine Learning Assignment 1

This week I used a decision tree to classify crater ejecta as either having a rampart edge, or not, based on the crater’s location (latitude, longitude) and its diameter. In this model rampart edge is a binary categorical response variable (coded as either 1 or 2) and the other variables are quantitative explanatory variables.

The resulting tree initially had 326 leaves before pruning, but only 4 after pruning, and a depth of only 3, with 2 branches. Pruning was done via cost complexity and branch splitting was done via entropy. The cost complexity graph shows the average ASE is at a minimum at 4, hence only 4 leaves after pruning.

The model accuracy is given in the confusion matrix and shows an error rate of only 15.6% for prediction of rampart edges, but an error rate of 56.7% for craters without rampart edges.

The Variable Importance table shows that crater diameter was most predictive of rampart edge, while latitude was second-most important. Interestingly, longitude was not used in the final version of the tree and had no importance.

The ROC curve shows the relationship of true positive vs true negative and has the characteristic curve indicating the true positive rate decreases as the true negative rate increases.

Lastly, the program was run several times using different seeds, providing the model with different training and test data sets. The resulting trees were all identical, having the same decision points, number of leaves, and structure. This indicates the model has low variance, because the tree has stable parameters over different training sets. However, the bias is high, as indicated by the 56.7% error rate for craters without rampart edges.

* Machine Learning Assignment 1

* Use a decision tree to classify a binary categorical response variable using;

* categorical or quantitative explanatory variables;

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

DATA craters; set mydata.marscrater_pds;

/* exclude empty rows, and rows I won't use in analysis */

IF morphology_ejecta_1 ^= 'Rd'; /* exclude radial ejecta */

IF morphology_ejecta_1 ^= ' '; /* exclude empty rows in ejecta 1 */

IF morphology_ejecta_2 ^= ' '; /* exclude empty rows in ejecta 2 */

/* create new variable to categorize ejecta_1 as rampart_edge or not */

/* this will be the binary categorical response variable */

IF FIND(morphology_ejecta_1,'R') GE 1 THEN rampart_edge = 1; /* rampart edge */

ELSE rampart_edge = 2; /* not a rampart_edge */

RUN;

ODS GRAPHICS ON; * turn on Output Delivery System graphics;

* SAS decision tree procedure;

PROC HPSPLIT seed=12345;

CLASS rampart_edge latitude_circle_image longitude_circle_image diam_circle_image;

MODEL rampart_edge = latitude_circle_image longitude_circle_image diam_circle_image;

GROW entropy;

PRUNE costcomplexity;

RUN;

0 notes

jamie-turrin-blog · 8 years ago

Photo

0 notes