ultragaurav
ultragaurav
YouthOfIndia
4 posts
Above and beyond Journalism
Don't wanna be here? Send us removal request.
ultragaurav · 5 years ago
Text
Running a k-means Cluster Analysis
K-means algorithm can be summarized as follows:
Specify the number of clusters (K) to be created (by the analyst)
Select randomly k objects from the data set as the initial cluster centers or means
Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables.
Iteratively minimize the total within sum of square (Eq. 7). That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the maximum number of iterations.
Computing k-means clustering in R
We can compute k-means in R with the kmeans function. Here will group the data into two clusters (centers = 2). The kmeans function also has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.
k2 <- kmeans(df, centers = 2, nstart = 25) str(k2) ## List of 9 ##  $ cluster     : Named int [1:50] 1 1 1 2 1 1 2 2 1 1 ... ##   ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ... ##  $ centers     : num [1:2, 1:4] 1.005 -0.67 1.014 -0.676 0.198 ... ##   ..- attr(*, "dimnames")=List of 2 ##   .. ..$ : chr [1:2] "1" "2" ##   .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape" ##  $ totss       : num 196 ##  $ withinss    : num [1:2] 46.7 56.1 ##  $ tot.withinss: num 103 ##  $ betweenss   : num 93.1 ##  $ size        : int [1:2] 20 30 ##  $ iter        : int 1 ##  $ ifault      : int 0 ##  - attr(*, "class")= chr "kmeans"
The output of kmeans is a list with several bits of information. The most important being:
cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers: A matrix of cluster centers.
totss: The total sum of squares.
withinss: Vector of within-cluster sum of squares, one component per cluster.
tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
betweenss: The between-cluster sum of squares, i.e. $totss-tot.withinss$.
size: The number of points in each cluster.
If we print the results we’ll see that our groupings resulted in 2 cluster sizes of 30 and 20. We see the cluster centers (means) for the two groups across the four variables (Murder, Assault, UrbanPop, Rape). We also get the cluster assignment for each observation (i.e. Alabama was assigned to cluster 2, Arkansas was assigned to cluster 1, etc.).
k2 ## K-means clustering with 2 clusters of sizes 20, 30 ## ## Cluster means: ##      Murder    Assault   UrbanPop       Rape ## 1  1.004934  1.0138274  0.1975853  0.8469650 ## 2 -0.669956 -0.6758849 -0.1317235 -0.5646433 ## ## Clustering vector: ##        Alabama         Alaska        Arizona       Arkansas     California ##              1              1              1              2              1 ##       Colorado    Connecticut       Delaware        Florida        Georgia ##              1              2              2              1              1 ##         Hawaii          Idaho       Illinois        Indiana           Iowa ##              2              2              1              2              2 ##         Kansas       Kentucky      Louisiana          Maine       Maryland ##              2              2              1              2              1 ##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri ##              2              1              2              1              1 ##        Montana       Nebraska         Nevada  New Hampshire     New Jersey ##              2              2              1              2              2 ##     New Mexico       New York North Carolina   North Dakota           Ohio ##              1              1              1              2              2 ##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina ##              2              2              2              2              1 ##   South Dakota      Tennessee          Texas           Utah        Vermont ##              2              1              1              2              2 ##       Virginia     Washington  West Virginia      Wisconsin        Wyoming ##              2              2              2              2              2 ## ## Within cluster sum of squares by cluster: ## [1] 46.74796 56.11445 ##  (between_SS / total_SS =  47.5 %) ## ## Available components: ## ## [1] "cluster"      "centers"      "totss"        "withinss"     ## [5] "tot.withinss" "betweenss"    "size"         "iter"         ## [9] "ifault"
We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
fviz_cluster(k2, data = df)
Alternatively, you can use standard pairwise scatter plots to illustrate the clusters compared to the original variables.
df %>%  as_tibble() %>%  mutate(cluster = k2$cluster,         state = row.names(USArrests)) %>%  ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) +  geom_text()
Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results. We can execute the same process for 3, 4, and 5 clusters, and the results are shown in the figure:
k3 <- kmeans(df, centers = 3, nstart = 25) k4 <- kmeans(df, centers = 4, nstart = 25) k5 <- kmeans(df, centers = 5, nstart = 25) # plots to compare p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2") p2 <- fviz_cluster(k3, geom = "point",  data = df) + ggtitle("k = 3") p3 <- fviz_cluster(k4, geom = "point",  data = df) + ggtitle("k = 4") p4 <- fviz_cluster(k5, geom = "point",  data = df) + ggtitle("k = 5") library(gridExtra) grid.arrange(p1, p2, p3, p4, nrow = 2)
0 notes
ultragaurav · 5 years ago
Text
Running a Lasso Regression Analysis
Linear Regression
The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution and are not correlated with each other (a problem called multi-collinearity).
The linear regression equation can be expressed in the following form: y = a1x1 + a2x2 + a3x3 + ..... + anxn + b
In the above equation:
y is the target variable.
x1, x2, x3, ... xn are the features.
a1, a2, a3, ... an are the coefficients.
b is the parameter of the model.
The parameters a and b in the model are selected through the ordinary least squares (OLS) method. This method works by minimizing the sum of squares of residuals (actual value - predicted value).
In order to fit the linear regression model, the first step is to instantiate the algorithm in the first line of code below using the lm() function. The second line prints the summary of the trained model.
1 2 lr = lm(unemploy ~ uempmed + psavert + pop + pce, data = train) summary(lr)
{r}
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Call: lm(formula = unemploy ~ uempmed + psavert + pop + pce, data = train) Residuals:    Min      1Q  Median      3Q     Max -2.4262 -0.7253  0.0278  0.6697  3.2753 Coefficients:            Estimate Std. Error t value Pr(>|t|)     (Intercept)  7.79077    0.04712 165.352  < 2e-16 *** uempmed      2.18021    0.08588  25.386  < 2e-16 *** psavert      0.79126    0.13244   5.975 5.14e-09 *** pop          5.95419    0.37405  15.918  < 2e-16 *** pce         -5.31578    0.32753 -16.230  < 2e-16 *** --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9435 on 396 degrees of freedom Multiple R-squared:  0.8542, Adjusted R-squared:  0.8527 F-statistic: 579.9 on 4 and 396 DF,  p-value: < 2.2e-16
The significance code ‘***’ in above output shows that all the features are important predictors. The Adjusted R-squared value of 0.8527 is also a good result. Let's evaluate the model further
0 notes
ultragaurav · 5 years ago
Text
Running a Random Forest
The other main concept in the random forest is that only a subset of all the features are considered for splitting each node in each decision tree. Generally this is set to sqrt(n_features) for classification meaning that if there are 16 features, at each node in each tree, only 4 random features will be considered for splitting the node. (The random forest can also be trained considering all the features at every node as is common in regression. These options can be controlled in the Scikit-Learn Random Forest implementation).
If you can comprehend a single decision tree, the idea of bagging, and random subsets of features, then you have a pretty good understanding of how a random forest works:
The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree.
To understand why a random forest is better than a single decision tree imagine the following scenario: you have to decide whether Tesla stock will go up and you have access to a dozen analysts who have no prior knowledge about the company. Each analyst has low bias because they don’t come in with any assumptions, and is allowed to learn from a dataset of news reports.
This might seem like an ideal situation, but the problem is that the reports are likely to contain noise in addition to real signals. Because the analysts are basing their predictions entirely on the data — they have high flexibility — they can be swayed by irrelevant information. The analysts might come up with differing predictions from the same dataset. Moreover, each individual analyst has high variance and would come up with drastically different predictions if given a different training set of reports.
The solution is to not rely on any one individual, but pool the votes of each analyst. Furthermore, like in a random forest, allow each analyst access to only a section of the reports and hope the effects of the noisy information will be cancelled out by the sampling. In real life, we rely on multiple sources (never trust a solitary Amazon review), and therefore, not only is a decision tree intuitive, but so is the idea of combining them in a random forest.
Random Forest in Practice
Next, we’ll build a random forest in Python using Scikit-Learn. Instead of learning a simple problem, we’ll use a real-world dataset split into a training and testing set. We use a test set as an estimate of how the model will perform on new data which also lets us determine how much the model is overfitting.
Dataset
The problem we’ll solve is a binary classification task with the goal of predicting an individual’s health. The features are socioeconomic and lifestyle characteristics of individuals and the label is 0 for poor health and 1 for good health. This dataset was collected by the Centers for Disease Control and Prevention and is available here.
Generally, 80% of a data science project is spent cleaning, exploring, and making features out of the data. However, for this article, we’ll stick to the modeling. (For details of the other steps, look at this article).
This is an imbalanced classification problem, so accuracy is not an appropriate metric. Instead we'll measure the Receiver Operating Characteristic Area Under the Curve (ROC AUC), a measure from 0 (worst) to 1 (best) with a random guess scoring 0.5. We can also plot the ROC curve to assess a model.
The notebook contains the implementation for both the decision tree and the random forest, but here we’ll just focus on the random forest. After reading in the data, we can instantiate and train a random forest as follows:
After a few minutes to train, the model is ready to make predictions on the testing data as follows:
We make class predictions (predict) as well as predicted probabilities (predict_proba) to calculate the ROC AUC. Once we have the testing predictions, we can calculate the ROC AUC.
Results
0 notes
ultragaurav · 5 years ago
Text
Run a classification Tree
With the increase in the implementation of Machine Learning algorithms for solving industry level problems, the demand for more complex and iterative algorithms has become a need. The Decision Tree Algorithm is one such algorithm that is used to solve both Regression and Classification problems.
In this blog on Decision Tree Algorithm, you will learn the working of Decision Tree and how it can be implemented to solve real-world problems. The following topics will be covered in this blog:
Why Decision Tree?
What Is A Decision Tree?
How Does The Decision Tree Algorithm Work?
Building A Decision Tree
Practical Implementation Of Decision Tree Algorithm Using R
To get in-depth knowledge on Data Science, you can enroll for live Data Science Certification Training by Edureka with 24/7 support and lifetime access.
Before I get started with why use Decision Tree, here’s a list of Machine Learning blogs that you should go through to understand the basics:
Machine Learning Algorithms
Introduction To Classification Algorithms
Random Forest Classifier
We’re all aware that there are n number of Machine Learning algorithms that can be used for analysis, so why should you choose Decision Tree? In the below section I’ve listed a few reasons.
Why Decision Tree Algorithm?
Decision Tree is considered to be one of the most useful Machine Learning algorithms since it can be used to solve a variety of problems. Here are a few reasons why you should use Decision Tree:
It is considered to be the most understandable Machine Learning algorithm and it can be easily interpreted.
It can be used for classification and regression problems.
Unlike most Machine Learning algorithms, it works effectively with non-linear data.
Constructing a Decision Tree is a very quick process since it uses only one feature per node to split the data.
What Is A Decision Tree Algorithm?
A Decision Tree is a Supervised Machine Learning algorithm which looks like an inverted tree, wherein each node represents a predictor variable (feature), the link between the nodes represents a Decision and each leaf node represents an outcome (response variable).  
To get a better understanding of a Decision Tree, let’s look at an example:
Let’s say that you hosted a huge party and you want to know how many of your guests were non-vegetarians. To solve this problem, let’s create a simple Decision Tree.
1 note · View note