statistics-practice - Tumblr blog

statistics-practice · 4 years ago

Text

k-means cluster analysis

The data was collected from the GUESSS study, which is a survey that measures entrepreneurial intention, entrepreneurial attitude, intention to become a successor of the parents' business, among other things, in young university students. As the objective of this exercise was to identify some characteristics that could have an impact on the decision to become a successor of the parents' business, the database was cleaned based on the responses of young university students whose parents had businesses. In addition, rows that were incomplete were eliminated, leaving a total of 3543 observations.

A k-means cluster analysis was conducted to identify underlying subgroups of potencial successors in the parents’ business in the future based on their similarity of responses on 15 variables (10 quantitative and 5 categorical).

Quantitative predictors:

years: The number of years since the business was established (discrete).

att: attitude (average of items measured in Likert scale from 1 – strongly disagree to 7 – strongly agree).

sub_nor: Subjective norms attitude (same as attitude).

sel_efi: Self-efficacy (same as attitude).

Afe_com: Affective commitment (same as attitude).

Nor_com: Normative commitment (same as attitude).

ins_ass: instrumental assistance (same as attitude).

car_rel: Career-related modeling (same as attitude).

Ver_enc: verbal encouragement (same as attitude).

emo_sup: emotional support (same as attitude).

Categorical predictors:

leader: Is your father or your mother leading the business operationally? (binary, 0: no and 1: yes).

bin_own_par: percentage of ownership share that is in the hands of your family (binary, 0: <= 50% and 1: > 50%).

bin_own_stu: personal ownership share in the business (binary, 0: <= 50% and 1: > 50%).

fam_bus: Do you regard the business as a "family business"? (binary, 0: no and 1: yes).

wor_: Have you been working for your parents’ business? (binary, 0: no and 1: yes).

All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The seed was established in 123.

Data were randomly split into a training set that included 70% of the observations (N=2467) and a test set that included 30% of the observations (N=1056).

A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the nine cluster solutions

The elbow curve suggested that the 2, 3 and 6-cluster solutions might be interpreted.

I chose the interpretation of the 3-cluster solution. Canonical discriminant analyses was used to reduce the 15 clustering variable down a few variables that accounted for most of the variance in the clustering variables.

Figure 2 shows a scatterplot of the first two canonical variables by cluster. This figure indicated that the observations in cluster 1 and 3 seem to overlap and therefore, there is no distinction between these two clusters. Clusters 1 and 2 clearly differ and it appears that the variance within these clusters is very small.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

The means on the 3 clusters showed that all negative means, without exception, are found in cluster 1, unlike the other two clusters. It appears that if a person answered toward the lowest values of one variable, he or she most likely did the same for the rest.

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on intention to become a successor in the parents’ business in the future (INT). A tukey test was used for post hoc comparisons between the clusters. Figure 3 shows boxplots. Results indicated significant differences between the clusters on INT (F(2, 2478)=638.37 p<.0001). The tukey post hoc comparisons showed significant differences between clusters 1 and 2 with a confidence interval for the differences of means (mean 2 – mean 1) of (2.12653, 2.42529). Students in 2 had the highest INT (mean=4.24, sd=1.82), and cluster 1 had the lowest INT (mean=1.96, sd=1.14).

Figure 3. Boxplots for the three clusters

SAS code:

PROC IMPORT DATAFILE ="/home/u58587038/Week4/data.csv" OUT = imported REPLACE; RUN; data clust; set imported; * create a unique identifier to merge cluster assignment variable with the main data set; idnum=_n_; keep idnum int years leader bin_own_par bin_own_stu fam_bus wor_ att sub_nor sel_efi Afe_com Nor_com ins_ass car_rel Ver_enc emo_sup; * delete observations with missing data; if cmiss(of _all_) then delete; run; ods graphics on; * Split data randomly into test and training data; proc surveyselect data=clust out=traintest seed = 123 samprate=0.7 method=srs outall; run; data clus_train; set traintest; if selected=1; run; data clus_test; set traintest; if selected=0; run; * standardize the clustering variables to have a mean of 0 and standard deviation of 1; proc standard data=clus_train out=clustvar mean=0 std=1; var years leader bin_own_par bin_own_stu fam_bus wor_ att sub_nor sel_efi Afe_com Nor_com ins_ass car_rel Ver_enc emo_sup; run; %macro kmean(K); proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300; var years leader bin_own_par bin_own_stu fam_bus wor_ att sub_nor sel_efi Afe_com Nor_com ins_ass car_rel Ver_enc emo_sup; run; %mend; %kmean(1); %kmean(2); %kmean(3); %kmean(4); %kmean(5); %kmean(6); %kmean(7); %kmean(8); %kmean(9); * extract r-square values from each cluster solution and then merge them to plot elbow curve; data clus1; set cluststat1; nclust=1; if _type_='RSQ'; keep nclust over_all; run; data clus2; set cluststat2; nclust=2; if _type_='RSQ'; keep nclust over_all; run; data clus3; set cluststat3; nclust=3; if _type_='RSQ'; keep nclust over_all; run; data clus4; set cluststat4; nclust=4; if _type_='RSQ'; keep nclust over_all; run; data clus5; set cluststat5; nclust=5; if _type_='RSQ'; keep nclust over_all; run; data clus6; set cluststat6; nclust=6; if _type_='RSQ'; keep nclust over_all; run; data clus7; set cluststat7; nclust=7; if _type_='RSQ'; keep nclust over_all; run; data clus8; set cluststat8; nclust=8; if _type_='RSQ'; keep nclust over_all; run; data clus9; set cluststat9; nclust=9; if _type_='RSQ'; keep nclust over_all; run; data clusrsquare; set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9; run; * plot elbow curve using r-square values; symbol1 color=blue interpol=join; proc gplot data=clusrsquare; plot over_all*nclust; run; ***************************************************************************************** Analysis of 3 cluster solution ***************************************************************************************** * plot clusters for 3 cluster solution; proc candisc data=outdata3 out=clustcan; class cluster; var years leader bin_own_par bin_own_stu fam_bus wor_ att sub_nor sel_efi Afe_com Nor_com ins_ass car_rel Ver_enc emo_sup; run; proc sgplot data=clustcan; scatter y=can2 x=can1 / group=cluster; run; * validate clusters on intention; * first merge clustering variable and assignment data with intention data; data int_data; set clus_train; keep idnum int; run; proc sort data=outdata3; by idnum; run; proc sort data=int_data; by idnum; run; data merged; merge outdata3 int_data; by idnum; run; proc sort data=merged; by cluster; run; proc means data=merged; var int; by cluster; run; proc anova data=merged; class cluster; model int = cluster; means cluster/tukey; run;

0 notes

statistics-practice · 4 years ago

Text

Running a lasso regression

SAS code:

PROC IMPORT DATAFILE ='/home/u58587038/Week3/pot.csv' OUT = imported REPLACE;

RUN;

data new;

set imported;

ods graphics on;

* Split data randomly into test and training data;

proc surveyselect data=new out=traintest seed = 123

samprate=0.7 method=srs outall;

run;

* lasso multiple regression with lars algorithm k=10 fold validation;

proc glmselect data=traintest plots=all seed=123;

partition ROLE=selected(train='1' test='0');

model int = years N_emp leader bin_own_par bin_own_stu fam_bus

wor_ bin_sib bin_att bin_sub_nor bin_sel_efi bin_Afe_com

bin_Nor_com bin_ins_ass bin_car_rel bin_Ver_enc bin_emo_sup/selection=lar(choose=cv stop=none) cvmethod=random(10);

run;

The aim of running a lasso regression was identifying a subset of variable from 18 categorial and quantitative independent variables. The target variable was intention to become a successor in the parents’ business in the future, which was measured as the mean of 6 Likert-scale questions (An example of questions is: Please indicate your level of agreement with the following statements (1=strongly disagree, 7=strongly agree). - I am ready to do anything to take over my parents’ business).

Categorial predictors:

leader: Is your father or your mother leading the business operationally? (binary, 0: no and 1: yes).

bin_own_par: percentage of ownership share that is in the hands of your family (binary, 0: <= 50% and 1: > 50%).

bin_own_stu: personal ownership share in the business (binary, 0: <= 50% and 1: > 50%).

fam_bus: Do you regard the business as a "family business"? (binary, 0: no and 1: yes).

wor_: Have you been working for your parents’ business? (binary, 0: no and 1: yes).

bin_sib: How many older siblings do you have? (binary: 0: <= 1 and 1: > 1).

bin_att: attitude (average of items measured in Likert scale from 1 – strongly disagree to 7 – strongly agree. 0: average <=3 and 1: average > 3).

bin_sub_nor: Subjective norms attitude (same as attitude).

bin_sel_efi: Self-efficacy (same as attitude).

bin_Afe_com: Affective commitment (same as attitude).

bin_Nor_com: Normative commitment (same as attitude).

bin_ins_ass: instrumental assistance (same as attitude).

bin_car_rel: Career-related modeling (same as attitude).

bin_Ver_enc: verbal encouragement (same as attitude).

bin_emo_sup: emotional support (same as attitude).

Quantitative predictors:

N_emp: number of employees (discrete).

years: The number of years since the business was established (discrete).

All predictor variables were standardized to have a mean of zero and a standard deviation of one. This procedure is implemented in glmselect in SAS.

Table 1: Surveyselect procedure

The seed was established in 123.

The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set.

Table 2: GLMSELECT procedure

The number of observations read and used from my data set was 3523 (I previously eliminated the rows with at least one missing answer).

Data were randomly split into a training set that included 70% of the observations (N=2467) and a test set that included 30% of the observations (N=1056).

Table 3: Number of observations read and used

The most important variables for predicting the intention to become a successor to the parent's business were bin_att, bin_Afe_com, bin_Nor_com, bin_emo_sup, bin_ins_ass, fam_bus. Of the 18 variables initially considered, 11 were finally retained by the model with a cross validation average (mean) squared error (CV PRESS) of 3776.7094. From variable number 12 onwards, the CV PRESS starts to increase.

Table 4: Variables retained by the model

The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step

Approximately 58% of the proportion of variation in the target variable (intention to become a successor) is explained by the model.

Table 5: goodness of fit statistics

Finally, we have the table with the estimates of the coefficients of the parameters.

Table 6: Estimates of the coefficients of the parameters

0 notes

statistics-practice · 4 years ago

Text

Random forests

Code SAS:

PROC IMPORT DATAFILE ='/home/u58587038/Week1/Pot_suc_v5.csv' OUT = imported REPLACE;

RUN;

data new;

set imported;

PROC HPFOREST;

target bin_int/level=nominal;

input leader bin_own_par bin_own_stu fam_bus

wor_ bin_sib bin_att bin_sub_nor bin_sel_efi bin_Afe_com

bin_Nor_com bin_ins_ass bin_car_rel bin_Ver_enc bin_emo_sup/level=nominal;

RUN;

The goal was to evaluate of a series of explanatory variables in predicting a binary response variable: intention to become a successor in the parents’ business in the future (bin_int). bin_int = 1, if person intends to and 2 if not.

The following explanatory variables were included in the model:

· years: years since the parents’s business was established (discrete).

· N_emp: number of employees (discrete).

· leader: Is your father or your mother leading the business operationally? (binary, 0: no and 1: yes).

· bin_own_par: percentage of ownership share that is in the hands of your family (binary, 0: <= 50% and 1: > 50%).

· bin_own_stu: personal ownership share in the business (binary, 0: <= 50% and 1: > 50%).

· fam_bus: Do you regard the business as a "family business"? (binary, 0: no and 1: yes).

· wor_: Have you been working for your parents’ business? (binary, 0: no and 1: yes).

· bin_sib: How many older siblings do you have? (binary: 0: <= 1 and 1: > 1).

· bin_att: attitude (average of items measured in Likert scale from 1 – strongly disagree to 7 – strongly agree. 0: average <=3 and 1: average > 3).

· bin_sub_nor: Subjective norms attitude (same as attitude).

· bin_sel_efi: Self-efficacy (same as attitude).

· bin_Afe_com: Affective commitment (same as attitude).

· bin_Nor_com: Normative commitment (same as attitude).

· bin_ins_ass: instrumental assistance (same as attitude).

· bin_car_rel: Career-related modeling (same as attitude).

· bin_Ver_enc: verbal encouragement (same as attitude).

· bin_emo_sup: emotional support (same as attitude).

Interpretation of the model:

· The number of observations read from my data set was 4881.

· The number of observations used was 4744.

· The explanatory variables with the highest relative importance scores were: bin_att, bin_Afe_com, bin_Nor_com, bin_emo_sup, bin_ins_ass and bin_sub_nor.

· The accuracy of the random forest was: 52%. we realize that the model does not do a good job classifying, since almost 50% classify well and 50% classify poorly.

Output:

0 notes

statistics-practice · 4 years ago

Text

Running a Classification Tree

Variables in the model:

· bin_int (target variable): intention to become a successor in the parents’ business in the future (binary, 1: yes and 2: no).