samaranand - Tumblr blog

samaranand · 3 years ago

Text

assignment-2

Data

The dataset is related to red variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

import pandas as pdimport numpy as npimport matplotlib.pylab as pltfrom sklearn.model_selection import train_test_split, cross_val_scorefrom sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifierfrom sklearn.manifold import MDSfrom sklearn.metrics.pairwise import pairwise_distancesfrom sklearn.metrics import accuracy_scoreimport seaborn as sns%matplotlib inline rnd_state = 4536

#colors = [9b59b6 3498db 95a5a6 e74c3c 34495e 2ecc71] plt.figure(figsize=(12 4)) plt.subplot(121) plt.scatter(representation[: 0] representa

1 note · View note

samaranand · 3 years ago

Text

Decision Tree

Decision trees are predictive models that explore nonlinear relationships and interactions among explanatory variables. When the response variable is categorical, the model is a called a classification tree. Decision trees create segmentations by applying a series of rules repeatedly to choose variable sets that best predict the response variable.

My data set does not have categorical response or explanatory variables, so I created some for this exercise. High CO2 emissions are defined as 30E9 or more metric ton.

Generated decision tree can be found below:

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested.

This decision tree uses these variables to predict output variable (TREG1) – whether person is a smoker, or not:

BIO_SEX – categorical – gender

GPA1 – numeric – current GPA

ALCEVR1 – binary – alcohol use

WHITE – binary – whether participant is white

BLACK – binary – whether participant is black

To train a decision tree I’ve split given dataset into train and test datasets in proportion 70/30.

From decision tree we can observe:

Participants who used alcohol were more likely to be smokers.(up to 5 times more smokers who used alcohol)

Most smokers are white

People with lower GPA are more usual to be regular smokers

Source code

import pandas as pd import sklearn.metrics from numpy.lib.format import magic from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus

RND_STATE = 55324

AH_data = pd.read_csv(“data/tree_addhealth.csv”) data_clean = AH_data.dropna() data_clean.dtypes data_clean.describe()

predictors = data_clean[[‘BIO_SEX’,’GPA1′, ‘ALCEVR1’, ‘WHITE’, ‘BLACK’]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=0.3)

classifier=DecisionTreeClassifier(random_state=RND_STATE) classifier=classifier.fit(pred_train, tar_train) predictions=classifier.predict(pred_test)

print(“Confusion matrix:\n”, sklearn.metrics.confusion_matrix(tar_test,predictions)) print(“Accuracy: “,sklearn.metrics.accuracy_score(tar_test, predictions))

out = StringIO() tree.export_graphviz(classifier, out_file=out, feature_names=[“sex”, “gpa”, “alcohol”, “white”, “black”],proportion=True, filled=True, max_depth=4) graph=pydotplus.graph_from_dot_data(out.getvalue()) img = Image(data=graph.create_png()) img

with open(“output” + “.png”, “wb”) as f: f.write(img.data)

#assignment-1

1 note · View note