coding02 - Tumblr blog

coding02 · 2 years ago

Text

Running a k-means Cluster Analysis

Load the necessary libraries

library(dplyr) # For data manipulation library(ggplot2) # For data visualization library(cluster) # For clustering analysis

Load your data set

data <- read.csv("your_data_file.csv")

Select your clustering variables

clustering_vars <- data %>% select(var1, var2, var3)

Normalize the clustering variables (optional)

clustering_vars_norm <- scale(clustering_vars)

Choose the number of clusters (k)

k <- 3

Run the k-means clustering analysis

set.seed(123) # For reproducibility kmeans_results <- kmeans(clustering_vars_norm, centers = k)

View the cluster assignments for each observation

cluster_assignments <- kmeans_results$cluster

Visualize the clusters using scatterplots (optional)

ggplot(data, aes(x = var1, y = var2, color = factor(cluster_assignments))) + geom_point() + labs(color = "Cluster") + theme_minimal()

ggplot(data, aes(x = var1, y = var3, color = factor(cluster_assignments))) + geom_point() + labs(color = "Cluster") + theme_minimal()

ggplot(data, aes(x = var2, y = var3, color = factor(cluster_assignments))) + geom_point() + labs(color = "Cluster") + theme_minimal()

View the cluster centers (centroids)

kmeans_results$centers

In this example, we first load the necessary libraries and our data set. We then select our clustering variables (var1, var2, and var3) and normalize them (if desired). We choose the number of clusters (k) to be 3, and run the k-means clustering analysis using the kmeans() function. We also set a seed for reproducibility purposes. We then view the cluster assignments for each observation, and optionally visualize the clusters using scatterplots. Finally, we view the cluster centers (centroids).

The output of this analysis includes the cluster assignments for each observation (stored in the cluster_assignments variable), the cluster centers (stored in the kmeans_results$centers object), and the scatterplots (if created). The cluster assignments and centers can be used to further analyze and interpret the clusters.

4 notes · View notes

coding02 · 2 years ago

Text

Running a Lasso Regression Analysis

Load your data into a statistical software package such as R or Python. You should have a dataset with a response variable and a set of predictor variables.

Preprocess the data by standardizing the predictor variables. This is important because the L1 penalty used by lasso regression assumes that all variables are on the same scale.

Split your data into k-folds. The typical value of k is 10, but you can adjust this depending on the size of your dataset.

Fit a lasso regression model on the training data for each fold. The glmnet package in R and scikit-learn library in Python provide functions to fit lasso regression models.

Evaluate the performance of the model on the test data for each fold using a metric such as mean squared error (MSE), mean absolute error (MAE), or R-squared. The glmnet and scikit-learn packages also provide functions to compute these metrics.

Calculate the average of the test errors over all k-folds. This will give you an estimate of the prediction error of your model.

Use the lasso regression model with the lowest test error to select a subset of predictors. The glmnet and scikit-learn packages provide functions to extract the selected predictors.

Evaluate the performance of the final lasso regression model on a separate validation dataset.

import numpy as np import pandas as pd

np.random.seed(123)

Generate the predictors

X = np.random.normal(0, 1, size=(100, 10))

Generate the response variable

y = X[:, 0] + 2*X[:, 1] - X[:, 2] + np.random.normal(0, 0.5, size=100)

Convert to a pandas dataframe

df = pd.DataFrame(X, columns=['X'+str(i+1) for i in range(10)]) df['y'] = y

Show the first few rows of the dataframe

print(df.head())

This will generate a dataframe df with 100 observations and 11 variables (10 predictors and 1 response variable).

Now, let's perform a Lasso regression analysis with 5-fold cross-validation using scikit-learn in Python. Here's the code:

python

from sklearn.linear_model import LassoCV # Split the data into predictors and response variable X = df.iloc[:, :-1].values y = df.iloc[:, -1].values # Create a LassoCV object with 5-fold cross-validation lasso_cv = LassoCV(cv=5) # Fit the Lasso regression model lasso_cv.fit(X, y) # Print the coefficients print('Coefficients:', lasso_cv.coef_) # Print the selected variables selected_vars = df.columns[:-1][lasso_cv.coef_ != 0] print('Selected variables:', list(selected_vars))

This will fit a Lasso regression model with 5-fold cross-validation and print the coefficients and the selected variables. The output should look something like this:

Output:

Coefficients: [ 0.97582621 1.98263392 -0.92618318 0. 0. 0. 0. 0. 0. 0. ] Selected variables: ['X1', 'X2', 'X3']

0 notes

coding02 · 2 years ago

Text

Running a Random Forest

Running a random forest involves several steps:

Data Preparation: First, you need to prepare the data for the random forest model. This includes cleaning the data, handling missing values, converting categorical variables to numerical ones, and splitting the data into training and testing sets.

Model Training: Once the data is prepared, you can train the random forest model using the training data. This involves creating multiple decision trees, each with a different subset of the training data and a random selection of features.

Model Evaluation: After the model is trained, you can evaluate its performance on the testing data. You can calculate metrics such as accuracy, precision, recall, and F1-score to determine how well the model is performing.

Hyperparameter Tuning: You can optimize the performance of the random forest model by tuning its hyperparameters. Hyperparameters are parameters that are set before the model is trained, such as the number of trees in the forest, the maximum depth of the trees, and the minimum number of samples required to split a node.

Prediction: Once the model is trained and optimized, you can use it to make predictions on new data.

To run a random forest, you can use a machine learning library such as scikit-learn in Python or caret in R. These libraries provide built-in functions for data preparation, model training, model evaluation, and hyperparameter tuning. Here's an example code snippet in Python using scikit-learn:

from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load the data X, y = load_data() # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create the random forest model model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42) # Train the model model.fit(X_train, y_train) # Evaluate the model on the testing data y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print('Accuracy:', accuracy)

In this example, we load the data, split it into training and testing sets, create a random forest model with 100 trees and a maximum depth of 5, train the model on the training data, and evaluate its performance on the testing data using the accuracy metric

1 note · View note

coding02 · 2 years ago

Text

Machine Learning for Data Analysis

Week 1: Running Classification Tree (ubuntu linux)

Week 1 python decision trees As part of peer review assigment I am working on decision trees in Python. This assigment is intended for Coursera course “Machine Learning for Data Analysis by Wesleyan University”.

Installation in Linux Ubuntu.

sudo chmod +x Anaconda3-2022.10-Linux-x86_64.sh

./Anaconda3-2022.10-Linux-x86_64.sh

conda install scikit-learn

conda install -n my_environment scikit-learn

pip install sklearn

pip install -U scikit-learn scipy matplotlib

sudo apt-get install graphviz

sudo apt-get install pydotplus

conda create -c conda-forge -n spyder-env spyder numpy scipy pandas matplotlib sympy cython

conda create -c conda-forge -n spyder-env spyder

conda activate spyder-env

conda config --env --add channels conda-forge

conda config --env --set channel_priority strict

python -m pip install pydotplus

dot -Tpng tree.dot -o tree5.png

I have to perform a decision tree analysis to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. Data set is provided by The National Longitudinal Study of Adolescent Health (AddHealth).

I will not complicate things here, therefore I am focusing on regular smoking (TREG1 variable).

I choosed few of vars to determine if they can predict regular smoking. predictor = dc[['HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN']]

Therefore I did change predictor variables to just two: gender and age. I got this tree.

And my code?

!/usr/bin/env python3

-- coding: utf-8 --

""" Created on Wed Dec 28 11:06:15 2022

@author: rfernandez """ import numpy as np import pandas as pd import matplotlib.pylab as plt

from sklearn.cross_validation import train_test_split

from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report from sklearn import tree import pydotplus import sklearn.metrics

#Load the dataset

data = pd.read_csv("tree_addhealth.csv")

dc = data.dropna() dc.dtypes dc.describe() """ Modeling and Prediction """

#Split into training and testing sets

predictor = dc[['HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN']] target = dc.TREG1 pr_train,pr_test,t_train,t_test= train_test_split(predictor,target, test_size=0.4) pr_train.shape pr_test.shape t_train.shape t_test.shape

#Build model on training data

classif = DecisionTreeClassifier() classif = classif.fit(pr_train,t_train) pred=classif.predict(pr_test) sklearn.metrics.confusion_matrix(t_test,pred) sklearn.metrics.accuracy_score(t_test,pred)

#Displaying the decision tree

tree.export_graphviz(classif, out_file='tree_race.dot')

I translated .dot file to .png using “dot -Tpng name.dot -o name.png” command.

#machine#learning#python#coursera

0 notes