gtdk - Tumblr blog

gtdk · 5 years ago

Text

K-means Cluster Analysis

Cluster analysis is an unsupervised learning method. The goal of cluster analysis is to group or cluster observations into subsets based on the similarity of responses on multiple variables. Observations that have similar response patterns are grouped together to form clusters.

Our end goal is to obtain clusters that have less variance within clusters and more variance between clusters.

Let’s look at how to implement k-means clustering with python step by step. We will be using mall customer dataset which has columns Age, Annual Income and Spending Score.

1. First we load the dataset in to a pandas dataframe and split the dataset into train and test sets.

clus_train, clus_test = train_test_split(df, test_size=.3, random_state=123)

2. Now since we do not know how many clusters we should use we need to use Elbow Method to identify how many clusters to chose as below.

from scipy.spatial.distance import cdist

clusters=range(1,10)

meandist=[] for k in clusters:

model=KMeans(n_clusters=k)

model.fit(clus_train)

clusassign=model.predict(clus_train)

meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])

#Plot average distance from observations from the cluster centroidto use the Elbow Method to identify number of clusters to choose

plt.plot(clusters, meandist)

plt.xlabel('Number of clusters')

plt.ylabel('Average distance')

plt.title('Selecting k with the Elbow Method')

In the resulting plot we see an elbow in when cluster number is 3. There is also an elbow when cluster numbers are 6. The best method will be to perform analysis for both cluster number 3 and 6 and after checking the accuracies and plots decide which cluster numbers perform best.

For this example we will take 3 clusters.

3. Let’s plot the scatter plot and see how the clusters are distributed.

model=KMeans(n_clusters=3)

model.fit(clus_train)

clusassign=model.predict(clus_train)

# plot clusters from sklearn.decomposition

import PCA

pca_2 = PCA(2)

plot_columns = pca_2.fit_transform(clus_train)

plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)

plt.xlabel('Canonical variable 1')

plt.ylabel('Canonical variable 2')

plt.title('Scatterplot of Canonical Variables for 3 Clusters')

plt.show()

Above is the resulting scatter plot.

#kmeans python

0 notes

gtdk · 5 years ago

Text

Lasso Regression

Lasso regression is a supervised learning method. LASSO actually means Least Absolute Selection and Shrinkage Operator. Lasso imposes a constrain on the model parameters and this causes the regression variables of some coefficients to shrink towards zero. This allows to identify the variables most strongly associated with the target variable by effectively removing unimportant variables from the model.

Let’s try to implement a lasso regression model with python step by step. We will be using diabetes data set of scikit-learn.

1. Split your dataset into train and test sets. You can easily split the diabetes dataset as below.

diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

X_train = diabetes_X[:-20]

X_test = diabetes_X[-20:]

y_train = diabetes_y[:-20]

y_test = diabetes_y[-20:]

2. Next we can create an object called model that will contain the results of the lasso regression. In parenthesis cv = 10 is added which askes python to use k-fold cross validation with 10 random folds from the training dataset to choose the final statistical model. To fit the lasso regression on the training set we use .fit.

from sklearn.linear_model import LassoLarsCV

model=LassoLarsCV(cv=10, precompute=False).fit(X_train,y_train)

3. Next let’s go ahead and ask python to print the regression coefficients from the model.

model.coef_

output:

array([ 0. , -194.57937837, 514.33452188, 302.88969517, -101.37105587, 0. , -234.57718479, 0. , 498.24639338, 66.14806771])

As you can see 3 coefficients have shrunk to zeros interpreting that they are unimportant variables. These variables are age,s2 and s4. Also the body mass index and the 5th blood serum has the highest coefficients.

4. Now we can plot the relative importance of the predictor selected at any step of the selection process and how the coefficients changed with addition to new variable and at which step that new variable entered the model.

import numpy as np

# plot coefficient progression

m_log_alphas = -np.log10(model.alphas_)

ax = plt.gca()plt.plot(m_log_alphas, model.coef_path_.T)

plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')

plt.ylabel('Regression Coefficients')

plt.xlabel('-log(alpha)')

plt.title('Regression Coefficients Progression for Lasso Paths')

Here the green line represents the body mass index which has the highest regression coefficient value. The yellow line is the s5.

5. Another important plot is the one that shows the changes in mean square error when the penalty parameter alpha change at each step.

# plot mean square error for each fold

m_log_alphascv = -np.log10(model.cv_alphas_)

plt.figure()

plt.plot(m_log_alphascv, model.mse_path_, ':')

plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)

plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')

plt.legend()

plt.xlabel('-log(alpha)')

plt.ylabel('Mean squared error')

plt.title('Mean squared error on each fold')

We can see that there is variability across the individual cross-validation folds in the training data set, but the change in the mean square error as variables are added to the model follows the same pattern for each fold. Initially it decreases rapidly and then levels off to a point at which adding more predictors doesn't lead to much reduction in the mean square error. This is to be expected as model complexity increases.

6. We can also print the average mean square error in the r square for the proportion of variance in school connectedness.

# MSE from training and test data

from sklearn.metrics import mean_squared_error

train_error = mean_squared_error(y_train, model.predict(X_train))

test_error = mean_squared_error(y_test, model.predict(X_test))

print ('training data MSE - ', train_error)

print ('test data MSE - ', test_error)

# R-square from training and test data

rsquared_train=model.score(X_train,y_train)

rsquared_test=model.score(X_test,y_test)

print ('training data R-square',rsquared_train)

print ('test data R-square',rsquared_test)

7. At each step of the estimation process, when a new predictor is entered into the model, the mean-square error for the validation fold is calculated for each of the other nine folds and then averaged. The model that produces the lowest mean-square error is selected by Python as the best model to validate using the test dataset.

#lassoregression python

0 notes

gtdk · 5 years ago

Text

Building a Random Forest with python

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

Lets see how we can build a random forest classifier using python.

1. Load your data set (say that your dataset is included in a csv file) with a pandas data frame.

df = pd.read_csv("tree_addhealth.csv")

2. Next because decision tree analyses cannot handle any NAs in our data set, next step is to create a clean data frame that drops all NAs.

df = df.dropna()

3. After setting the explanatory and target variables we need to split the dataset into train and test sets.

4. Then you can train your classifier and do predictions on the test set.

classifier=RandomForestClassifier(n_estimators=25)classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

5. We can also run a different number of trees and see the effect of that on the accuracy of the prediction and plot the results.

#randomforest

0 notes

gtdk · 5 years ago

Text

Running a classification tree with python

Decision trees are predictive models that allow for a data driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making.

How to build a decision tree with python

1. Split your dataset into train and test sets

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

2. Import skelarn DecisionTreeClassifier then fit on your train dataset

classifier=DecisionTreeClassifier()

classifier=classifier.fit(pred_train,tar_train)

3. Finally you can do predictions on your test dataset

predictions=classifier.predict(pred_test)

4. You can also visualize your tree with Graphviz and pydotplus

tree.export_graphviz(classifier, out_file=out)

graph=pydotplus.graph_from_dot_data(out.getvalue())

Above is the obtained decision tree.

#decisiontrees

1 note · View note