predictandprescribe
predictandprescribe
Untitled
2 posts
Don't wanna be here? Send us removal request.
predictandprescribe · 7 days ago
Text
Assignment: Running a Random Forest
I am an R user, so conducted the assignment in R instead of SAS or Python.
load packages
library(randomForest) library(caret) library(ggplot2) library(readr) library(dplyr) library(tidyr)
Load the dataset
AH_data <- read_csv("tree_addhealth.csv") data_clean <- AH_data %>% drop_na()
Examine data
str(data_clean) summary(data_clean)
Define predictors and target
predictors <- data_clean %>% select(BIO_SEX, HISPANIC, WHITE, BLACK, NAMERICAN, ASIAN, age, ALCEVR1, ALCPROBS1, marever1, cocever1, inhever1, cigavail, DEP1, ESTEEM1, VIOL1, PASSIST, DEVIANT1, SCHCONN1, GPA1, EXPEL1, FAMCONCT, PARACTV, PARPRES)
target <- data_clean$TREG1
Split into training and testing sets
set.seed(123) split <- createDataPartition(target, p = 0.6, list = FALSE) pred_train <- predictors[split, ] pred_test <- predictors[-split, ] tar_train <- target[split] tar_test <- target[-split]
Train random forest model
set.seed(123) rf_model <- randomForest(x = pred_train, y = as.factor(tar_train), ntree = 25) rf_pred <- predict(rf_model, pred_test)
Confusion matrix and accuracy
conf_matrix <- confusionMatrix(rf_pred, as.factor(tar_test)) print(conf_matrix)
Feature importance
importance(rf_model) varImpPlot(rf_model)
Accuracy for different number of trees
trees <- 1:25 accuracy <- numeric(length(trees))
for (i in trees) { rf_temp <- randomForest(x = pred_train, y = as.factor(tar_train), ntree = i) pred_temp <- predict(rf_temp, pred_test) accuracy[i] <- mean(pred_temp == tar_test) }
Plot accuracy vs number of trees
accuracy_df <- data.frame(trees = trees, accuracy = accuracy)
ggplot(accuracy_df, aes(x = trees, y = accuracy)) + geom_line(color = "blue") + labs(title = "Accuracy vs. Number of Trees", x = "Number of Trees", y = "Accuracy") + theme_minimal() I conducted a random forest analysis to evaluate the importance of a variety of categorical and continuous explanatory variables on a categorical outcome variable - being a regular smoker. The five explanatory variables with the highest importance in predicting regular smoking were: ever having used marijuana, age, deviant behaviour, GPA, and school connectedness. The accuracy of the random forest was 83%, which was achieved within 3 trees. Growing additional trees did not add much to the overall accuracy of the model, suggesting a small number of trees is sufficient for identifying the important explanatory variables.
0 notes
predictandprescribe · 7 days ago
Text
Assignment: Running a Classification Tree
As an R user, I chose to translate the Python code into R to run my analyses. Below is my R code for the assignment:
Load packages
library(ggplot2) library(rpart) library(rpart.plot) library(caret)
Load and clean dataset
AH_data <- read.csv("tree_addhealth.csv", stringsAsFactors = TRUE)
Remove rows with missing values
data_clean <- na.omit(AH_data)
Check data structure
str(data_clean) summary(data_clean)
Modelling and prediction
Define predictors and target
predictors <- data_clean[, c('BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1', 'cigavail','DEP1','ESTEEM1','VIOL1','PASSIST','DEVIANT1', 'SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV','PARPRES')]
target <- data_clean$TREG1
Split into training and testing sets (60/40)
set.seed(123) # for reproducibility train_index <- createDataPartition(target, p = 0.6, list = FALSE)
pred_train <- predictors[train_index, ] pred_test <- predictors[-train_index, ] tar_train <- target[train_index] tar_test <- target[-train_index]
Train decision tree
classifier <- rpart(tar_train ~ ., data = data.frame(pred_train, tar_train), method = "class")
Predict on test data
predictions <- predict(classifier, newdata = pred_test, type = "class")
Confusion matrix and accuracy
#ensure the target and predictions are factors with the same levels tar_test <- factor(tar_test) predictions <- factor(predictions, levels = levels(tar_test))
#now compute the confusion matrix confusionMatrix(predictions, tar_test)
visualize the tree
rpart.plot(classifier, type = 3, extra = 101, fallen.leaves = TRUE)
Below is the output:
Tumblr media
This decision tree is exploring the relationships between regular tobacco use and predictive variables. The results of this decision tree demonstrate the the majority of people are not regular smokers. The first branch is whether someone has used ever used marijuana. The highest risk group for regular smoking are people who had used marijuana, were white, left their family before the age of 23, had a GPA greater than or equal to 2.7, and had a violence score of greater than or equal to 4. The people who are regular smokers have multiple concurrent risk factors.
1 note · View note