#ntree | Explore Tumblr posts and blogs

predictandprescribe · 1 month ago

Text

Assignment: Running a Random Forest

I am an R user, so conducted the assignment in R instead of SAS or Python.

load packages

library(randomForest) library(caret) library(ggplot2) library(readr) library(dplyr) library(tidyr)

Load the dataset

AH_data <- read_csv("tree_addhealth.csv") data_clean <- AH_data %>% drop_na()

Examine data

str(data_clean) summary(data_clean)

Define predictors and target

predictors <- data_clean %>% select(BIO_SEX, HISPANIC, WHITE, BLACK, NAMERICAN, ASIAN, age, ALCEVR1, ALCPROBS1, marever1, cocever1, inhever1, cigavail, DEP1, ESTEEM1, VIOL1, PASSIST, DEVIANT1, SCHCONN1, GPA1, EXPEL1, FAMCONCT, PARACTV, PARPRES)

target <- data_clean$TREG1

Split into training and testing sets

set.seed(123) split <- createDataPartition(target, p = 0.6, list = FALSE) pred_train <- predictors[split, ] pred_test <- predictors[-split, ] tar_train <- target[split] tar_test <- target[-split]

Train random forest model

set.seed(123) rf_model <- randomForest(x = pred_train, y = as.factor(tar_train), ntree = 25) rf_pred <- predict(rf_model, pred_test)

Confusion matrix and accuracy

conf_matrix <- confusionMatrix(rf_pred, as.factor(tar_test)) print(conf_matrix)

Feature importance

importance(rf_model) varImpPlot(rf_model)

Accuracy for different number of trees

trees <- 1:25 accuracy <- numeric(length(trees))

for (i in trees) { rf_temp <- randomForest(x = pred_train, y = as.factor(tar_train), ntree = i) pred_temp <- predict(rf_temp, pred_test) accuracy[i] <- mean(pred_temp == tar_test) }

Plot accuracy vs number of trees

accuracy_df <- data.frame(trees = trees, accuracy = accuracy)

ggplot(accuracy_df, aes(x = trees, y = accuracy)) + geom_line(color = "blue") + labs(title = "Accuracy vs. Number of Trees", x = "Number of Trees", y = "Accuracy") + theme_minimal() I conducted a random forest analysis to evaluate the importance of a variety of categorical and continuous explanatory variables on a categorical outcome variable - being a regular smoker. The five explanatory variables with the highest importance in predicting regular smoking were: ever having used marijuana, age, deviant behaviour, GPA, and school connectedness. The accuracy of the random forest was 83%, which was achieved within 3 trees. Growing additional trees did not add much to the overall accuracy of the model, suggesting a small number of trees is sufficient for identifying the important explanatory variables.

0 notes

wedjenowif · 9 months ago

Text

If you were called Ree, your name would not only mean wild, fierce, outrageous, overexcited, frenzied, delirious and crazy

But your nickname could be c*ntree

Like omg

To any of the Rees out there i now live you simply for your name and i think yiu should take this new nicknane

Thank you

Yours faithfully,

wedjenowif

#names #i find myself hilarious #i dont know #honestly #i dont care #lol

0 notes

workforcesolution · 1 year ago

Text

Random Forest Classification using lending club data

1. What is Lending Club? 2. Data used 3. What is Random Forest Classification 4. Evaluation

1. What is Lending Club? Lending club is a peer to peer lending company which is headquartered in San Francisco. The company connects people who need money (Borrowers) with people who have money (Investors) through its online marketplace. Investors, who are looking to get a solid return for their investment, purchase Notes which are fractions of loans. Borrowers, who need loans for various reasons such as to consolidate debt, improve homes or make a major purchase, can apply for a loan by creating an account on Lendingclub.com. They will submit a loan application which will mention the amount required through the loan. Lending Club will screen the borrowers, facilitate the transaction and service the loan. Borrowers will repay the loan by making monthly payments to Lending Club.

2. Decision trees and Random Forest Classification: Decision trees and random forest classifiers help us classify our data. For example, if a customer has made a purchase (Yes or No), Gender of a person (Male or Female) etc. In this project we will be trying to predict if the borrower is able to repay the loan or not. The goal of decision trees model is to predict the value of the target variable based on several input variables. In a simple way, decision tree algorithm will ask questions to the input variable regarding its attributes. Each time it receives an answer a follow up question will be asked till it reaches a conclusion about the variable.

Random forest classification is one of an ensemble algorithm. It combines lot of decision tree methods. Instead of running the decision tree method once, we will be running multiple methods of decision trees and that will give us a random forest method. We start by selecting random variables from the training data. Then we build a decision tree based on these random variables. Next, we will select the number of trees (Ntrees) we want to build and repeat the previous steps. For any new variable, we will make the Ntrees vote the category to which the variable belongs to and assign the variable to that category based on majority of votes.

3. Data We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. The data is publicly available on lendingclub.com Here are what the columns represent:

a. credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. b. purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other"). c. int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. d. installment: The monthly installments owed by the borrower if the loan is funded. e. log.annual.inc: The natural log of the self-reported annual income of the borrower. f. dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income). g. fico: The FICO credit score of the borrower. h. days.with.cr.line: The number of days the borrower has had a credit line. i. revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle). j. revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available). k. inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months. l. delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years. m. pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

4. Exploratory data analysis: After performing some exploratory data analysis we observed that:

a. People with low FICO score tend to not meet the credit underwriting criteria of Lending Club. b. Majority of people are still in process of repaying the loan. c. Debt consolidation is a popular reason for pursuing the loan. d. As FICO score increases, there is better credit and less interest rate on the loan.

5. Setting up Data for the Model: As there are some categorical features in the data, we will use pandas ability to create dummy variables so that sci-kit learn will be able to understand them. We will use sci-kit learns ability to split the data into train and test data set. We build the model using the train data set and evaluate the model performance on test data set. Usually such a split is 70:30 in ratio.

6. Build the Model and evaluate the performance: We first build the decision tree model using sklearn.tree to import DecisionTreeClassifier. We fit the classifier on train data and predict the results for test data. Also, we are trying the ensemble methods, we will use the random forest classifier to get a better prediction accuracy. Ensemble methods use various machine learning techniques to deliver the best prediction for our data. As we have seen above Random Forest classification uses multiple decision trees to improve our prediction accuracy.

About Rang Technologies: Headquartered in New Jersey, Rang Technologies has dedicated over a decade delivering innovative solutions and best talent to help businesses get the most out of the latest technologies in their digital transformation journey. Read More...

#data science #hr consultancy #talent acquisition #workforce #workforcesolutions

0 notes

techblog-365 · 2 years ago

Text

Random Forest Classification using lending club data