ultragaurav - Tumblr blog

ultragaurav · 5 years ago

Text

Running a k-means Cluster Analysis

K-means algorithm can be summarized as follows:

Specify the number of clusters (K) to be created (by the analyst)

Select randomly k objects from the data set as the initial cluster centers or means

Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid

For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables.

Iteratively minimize the total within sum of square (Eq. 7). That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the maximum number of iterations.

Computing k-means clustering in R

We can compute k-means in R with the kmeans function. Here will group the data into two clusters (centers = 2). The kmeans function also has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.

k2 <- kmeans(df, centers = 2, nstart = 25) str(k2) ## List of 9 ## $ cluster : Named int [1:50] 1 1 1 2 1 1 2 2 1 1 ... ## ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ... ## $ centers : num [1:2, 1:4] 1.005 -0.67 1.014 -0.676 0.198 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:2] "1" "2" ## .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape" ## $ totss : num 196 ## $ withinss : num [1:2] 46.7 56.1 ## $ tot.withinss: num 103 ## $ betweenss : num 93.1 ## $ size : int [1:2] 20 30 ## $ iter : int 1 ## $ ifault : int 0 ## - attr(*, "class")= chr "kmeans"

The output of kmeans is a list with several bits of information. The most important being:

cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

centers: A matrix of cluster centers.

totss: The total sum of squares.

withinss: Vector of within-cluster sum of squares, one component per cluster.

tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).

betweenss: The between-cluster sum of squares, i.e. $totss-tot.withinss$.

size: The number of points in each cluster.

If we print the results we’ll see that our groupings resulted in 2 cluster sizes of 30 and 20. We see the cluster centers (means) for the two groups across the four variables (Murder, Assault, UrbanPop, Rape). We also get the cluster assignment for each observation (i.e. Alabama was assigned to cluster 2, Arkansas was assigned to cluster 1, etc.).

k2 ## K-means clustering with 2 clusters of sizes 20, 30 ## ## Cluster means: ## Murder Assault UrbanPop Rape ## 1 1.004934 1.0138274 0.1975853 0.8469650 ## 2 -0.669956 -0.6758849 -0.1317235 -0.5646433 ## ## Clustering vector: ## Alabama Alaska Arizona Arkansas California ## 1 1 1 2 1 ## Colorado Connecticut Delaware Florida Georgia ## 1 2 2 1 1 ## Hawaii Idaho Illinois Indiana Iowa ## 2 2 1 2 2 ## Kansas Kentucky Louisiana Maine Maryland ## 2 2 1 2 1 ## Massachusetts Michigan Minnesota Mississippi Missouri ## 2 1 2 1 1 ## Montana Nebraska Nevada New Hampshire New Jersey ## 2 2 1 2 2 ## New Mexico New York North Carolina North Dakota Ohio ## 1 1 1 2 2 ## Oklahoma Oregon Pennsylvania Rhode Island South Carolina ## 2 2 2 2 1 ## South Dakota Tennessee Texas Utah Vermont ## 2 1 1 2 2 ## Virginia Washington West Virginia Wisconsin Wyoming ## 2 2 2 2 2 ## ## Within cluster sum of squares by cluster: ## [1] 46.74796 56.11445 ## (between_SS / total_SS = 47.5 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" "iter" ## [9] "ifault"

We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

fviz_cluster(k2, data = df)

Alternatively, you can use standard pairwise scatter plots to illustrate the clusters compared to the original variables.

df %>% as_tibble() %>% mutate(cluster = k2$cluster, state = row.names(USArrests)) %>% ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) + geom_text()

Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results. We can execute the same process for 3, 4, and 5 clusters, and the results are shown in the figure:

k3 <- kmeans(df, centers = 3, nstart = 25) k4 <- kmeans(df, centers = 4, nstart = 25) k5 <- kmeans(df, centers = 5, nstart = 25) # plots to compare p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2") p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3") p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4") p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5") library(gridExtra) grid.arrange(p1, p2, p3, p4, nrow = 2)

0 notes

ultragaurav · 5 years ago

Text

Running a Lasso Regression Analysis

Linear Regression

The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution and are not correlated with each other (a problem called multi-collinearity).

The linear regression equation can be expressed in the following form: y = a1x1 + a2x2 + a3x3 + ..... + anxn + b

In the above equation:

y is the target variable.

x1, x2, x3, ... xn are the features.

a1, a2, a3, ... an are the coefficients.

b is the parameter of the model.

The parameters a and b in the model are selected through the ordinary least squares (OLS) method. This method works by minimizing the sum of squares of residuals (actual value - predicted value).

In order to fit the linear regression model, the first step is to instantiate the algorithm in the first line of code below using the lm() function. The second line prints the summary of the trained model.

1 2 lr = lm(unemploy ~ uempmed + psavert + pop + pce, data = train) summary(lr)

{r}

Output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Call: lm(formula = unemploy ~ uempmed + psavert + pop + pce, data = train) Residuals: Min 1Q Median 3Q Max -2.4262 -0.7253 0.0278 0.6697 3.2753 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.79077 0.04712 165.352 < 2e-16 *** uempmed 2.18021 0.08588 25.386 < 2e-16 *** psavert 0.79126 0.13244 5.975 5.14e-09 *** pop 5.95419 0.37405 15.918 < 2e-16 *** pce -5.31578 0.32753 -16.230 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9435 on 396 degrees of freedom Multiple R-squared: 0.8542, Adjusted R-squared: 0.8527 F-statistic: 579.9 on 4 and 396 DF, p-value: < 2.2e-16

The significance code ‘***’ in above output shows that all the features are important predictors. The Adjusted R-squared value of 0.8527 is also a good result. Let's evaluate the model further

0 notes

ultragaurav · 5 years ago

Text

Running a Random Forest

The other main concept in the random forest is that only a subset of all the features are considered for splitting each node in each decision tree. Generally this is set to sqrt(n_features) for classification meaning that if there are 16 features, at each node in each tree, only 4 random features will be considered for splitting the node. (The random forest can also be trained considering all the features at every node as is common in regression. These options can be controlled in the Scikit-Learn Random Forest implementation).

If you can comprehend a single decision tree, the idea of bagging, and random subsets of features, then you have a pretty good understanding of how a random forest works:

The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree.

To understand why a random forest is better than a single decision tree imagine the following scenario: you have to decide whether Tesla stock will go up and you have access to a dozen analysts who have no prior knowledge about the company. Each analyst has low bias because they don’t come in with any assumptions, and is allowed to learn from a dataset of news reports.

This might seem like an ideal situation, but the problem is that the reports are likely to contain noise in addition to real signals. Because the analysts are basing their predictions entirely on the data — they have high flexibility — they can be swayed by irrelevant information. The analysts might come up with differing predictions from the same dataset. Moreover, each individual analyst has high variance and would come up with drastically different predictions if given a different training set of reports.

The solution is to not rely on any one individual, but pool the votes of each analyst. Furthermore, like in a random forest, allow each analyst access to only a section of the reports and hope the effects of the noisy information will be cancelled out by the sampling. In real life, we rely on multiple sources (never trust a solitary Amazon review), and therefore, not only is a decision tree intuitive, but so is the idea of combining them in a random forest.

Random Forest in Practice

Next, we’ll build a random forest in Python using Scikit-Learn. Instead of learning a simple problem, we’ll use a real-world dataset split into a training and testing set. We use a test set as an estimate of how the model will perform on new data which also lets us determine how much the model is overfitting.

Dataset

The problem we’ll solve is a binary classification task with the goal of predicting an individual’s health. The features are socioeconomic and lifestyle characteristics of individuals and the label is 0 for poor health and 1 for good health. This dataset was collected by the Centers for Disease Control and Prevention and is available here.

Generally, 80% of a data science project is spent cleaning, exploring, and making features out of the data. However, for this article, we’ll stick to the modeling. (For details of the other steps, look at this article).

This is an imbalanced classification problem, so accuracy is not an appropriate metric. Instead we'll measure the Receiver Operating Characteristic Area Under the Curve (ROC AUC), a measure from 0 (worst) to 1 (best) with a random guess scoring 0.5. We can also plot the ROC curve to assess a model.

The notebook contains the implementation for both the decision tree and the random forest, but here we’ll just focus on the random forest. After reading in the data, we can instantiate and train a random forest as follows:

After a few minutes to train, the model is ready to make predictions on the testing data as follows:

We make class predictions (predict) as well as predicted probabilities (predict_proba) to calculate the ROC AUC. Once we have the testing predictions, we can calculate the ROC AUC.

Results

0 notes

ultragaurav · 5 years ago

Text

Run a classification Tree

With the increase in the implementation of Machine Learning algorithms for solving industry level problems, the demand for more complex and iterative algorithms has become a need. The Decision Tree Algorithm is one such algorithm that is used to solve both Regression and Classification problems.

In this blog on Decision Tree Algorithm, you will learn the working of Decision Tree and how it can be implemented to solve real-world problems. The following topics will be covered in this blog:

Why Decision Tree?

What Is A Decision Tree?

How Does The Decision Tree Algorithm Work?

Building A Decision Tree

Practical Implementation Of Decision Tree Algorithm Using R

To get in-depth knowledge on Data Science, you can enroll for live Data Science Certification Training by Edureka with 24/7 support and lifetime access.

Before I get started with why use Decision Tree, here’s a list of Machine Learning blogs that you should go through to understand the basics:

Machine Learning Algorithms

Introduction To Classification Algorithms

Random Forest Classifier

We’re all aware that there are n number of Machine Learning algorithms that can be used for analysis, so why should you choose Decision Tree? In the below section I’ve listed a few reasons.

Why Decision Tree Algorithm?

Decision Tree is considered to be one of the most useful Machine Learning algorithms since it can be used to solve a variety of problems. Here are a few reasons why you should use Decision Tree:

It is considered to be the most understandable Machine Learning algorithm and it can be easily interpreted.

It can be used for classification and regression problems.

Unlike most Machine Learning algorithms, it works effectively with non-linear data.

Constructing a Decision Tree is a very quick process since it uses only one feature per node to split the data.

What Is A Decision Tree Algorithm?

A Decision Tree is a Supervised Machine Learning algorithm which looks like an inverted tree, wherein each node represents a predictor variable (feature), the link between the nodes represents a Decision and each leaf node represents an outcome (response variable).

To get a better understanding of a Decision Tree, let’s look at an example:

Let’s say that you hosted a huge party and you want to know how many of your guests were non-vegetarians. To solve this problem, let’s create a simple Decision Tree.

1 note · View note