predictandprescribe - Tumblr blog

predictandprescribe · 7 days ago

Text

Assignment: Running a Random Forest

I am an R user, so conducted the assignment in R instead of SAS or Python.

load packages

library(randomForest) library(caret) library(ggplot2) library(readr) library(dplyr) library(tidyr)

Load the dataset

AH_data <- read_csv("tree_addhealth.csv") data_clean <- AH_data %>% drop_na()

Examine data

str(data_clean) summary(data_clean)

Define predictors and target

predictors <- data_clean %>% select(BIO_SEX, HISPANIC, WHITE, BLACK, NAMERICAN, ASIAN, age, ALCEVR1, ALCPROBS1, marever1, cocever1, inhever1, cigavail, DEP1, ESTEEM1, VIOL1, PASSIST, DEVIANT1, SCHCONN1, GPA1, EXPEL1, FAMCONCT, PARACTV, PARPRES)

target <- data_clean$TREG1

Split into training and testing sets

set.seed(123) split <- createDataPartition(target, p = 0.6, list = FALSE) pred_train <- predictors[split, ] pred_test <- predictors[-split, ] tar_train <- target[split] tar_test <- target[-split]

Train random forest model

set.seed(123) rf_model <- randomForest(x = pred_train, y = as.factor(tar_train), ntree = 25) rf_pred <- predict(rf_model, pred_test)

Confusion matrix and accuracy

conf_matrix <- confusionMatrix(rf_pred, as.factor(tar_test)) print(conf_matrix)

Feature importance

importance(rf_model) varImpPlot(rf_model)

Accuracy for different number of trees

trees <- 1:25 accuracy <- numeric(length(trees))

for (i in trees) { rf_temp <- randomForest(x = pred_train, y = as.factor(tar_train), ntree = i) pred_temp <- predict(rf_temp, pred_test) accuracy[i] <- mean(pred_temp == tar_test) }

Plot accuracy vs number of trees

accuracy_df <- data.frame(trees = trees, accuracy = accuracy)

ggplot(accuracy_df, aes(x = trees, y = accuracy)) + geom_line(color = "blue") + labs(title = "Accuracy vs. Number of Trees", x = "Number of Trees", y = "Accuracy") + theme_minimal() I conducted a random forest analysis to evaluate the importance of a variety of categorical and continuous explanatory variables on a categorical outcome variable - being a regular smoker. The five explanatory variables with the highest importance in predicting regular smoking were: ever having used marijuana, age, deviant behaviour, GPA, and school connectedness. The accuracy of the random forest was 83%, which was achieved within 3 trees. Growing additional trees did not add much to the overall accuracy of the model, suggesting a small number of trees is sufficient for identifying the important explanatory variables.

0 notes

predictandprescribe · 7 days ago

Text

Assignment: Running a Classification Tree

As an R user, I chose to translate the Python code into R to run my analyses. Below is my R code for the assignment:

Load packages

library(ggplot2) library(rpart) library(rpart.plot) library(caret)

Load and clean dataset

AH_data <- read.csv("tree_addhealth.csv", stringsAsFactors = TRUE)

Remove rows with missing values

data_clean <- na.omit(AH_data)

Check data structure

str(data_clean) summary(data_clean)

Modelling and prediction

Define predictors and target

predictors <- data_clean[, c('BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1', 'cigavail','DEP1','ESTEEM1','VIOL1','PASSIST','DEVIANT1', 'SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV','PARPRES')]

target <- data_clean$TREG1

Split into training and testing sets (60/40)

set.seed(123) # for reproducibility train_index <- createDataPartition(target, p = 0.6, list = FALSE)

pred_train <- predictors[train_index, ] pred_test <- predictors[-train_index, ] tar_train <- target[train_index] tar_test <- target[-train_index]

Train decision tree

classifier <- rpart(tar_train ~ ., data = data.frame(pred_train, tar_train), method = "class")

Predict on test data

predictions <- predict(classifier, newdata = pred_test, type = "class")

Confusion matrix and accuracy

#ensure the target and predictions are factors with the same levels tar_test <- factor(tar_test) predictions <- factor(predictions, levels = levels(tar_test))

#now compute the confusion matrix confusionMatrix(predictions, tar_test)

visualize the tree

rpart.plot(classifier, type = 3, extra = 101, fallen.leaves = TRUE)

Below is the output:

This decision tree is exploring the relationships between regular tobacco use and predictive variables. The results of this decision tree demonstrate the the majority of people are not regular smokers. The first branch is whether someone has used ever used marijuana. The highest risk group for regular smoking are people who had used marijuana, were white, left their family before the age of 23, had a GPA greater than or equal to 2.7, and had a violence score of greater than or equal to 4. The people who are regular smokers have multiple concurrent risk factors.

1 note · View note