Don't wanna be here? Send us removal request.
Text
assignment-2
Data
The dataset is related to red variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
import pandas as pdimport numpy as npimport matplotlib.pylab as pltfrom sklearn.model_selection import train_test_split, cross_val_scorefrom sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifierfrom sklearn.manifold import MDSfrom sklearn.metrics.pairwise import pairwise_distancesfrom sklearn.metrics import accuracy_scoreimport seaborn as sns%matplotlib inline rnd_state = 4536
1 note
·
View note
Text
Decision Tree
Decision trees are predictive models that explore nonlinear relationships and interactions among explanatory variables. When the response variable is categorical, the model is a called a classification tree. Decision trees create segmentations by applying a series of rules repeatedly to choose variable sets that best predict the response variable.
My data set does not have categorical response or explanatory variables, so I created some for this exercise. High CO2 emissions are defined as 30E9 or more metric ton.
Generated decision tree can be found below:
Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested.
This decision tree uses these variables to predict output variable (TREG1) – whether person is a smoker, or not:
BIO_SEX – categorical – gender
GPA1 – numeric – current GPA
ALCEVR1 – binary – alcohol use
WHITE – binary – whether participant is white
BLACK – binary – whether participant is black
To train a decision tree I’ve split given dataset into train and test datasets in proportion 70/30.
From decision tree we can observe:
Participants who used alcohol were more likely to be smokers.(up to 5 times more smokers who used alcohol)
Most smokers are white
People with lower GPA are more usual to be regular smokers
Source code
import pandas as pd import sklearn.metrics from numpy.lib.format import magic from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus
RND_STATE = 55324
AH_data = pd.read_csv(“data/tree_addhealth.csv”) data_clean = AH_data.dropna() data_clean.dtypes data_clean.describe()
predictors = data_clean[[‘BIO_SEX’,’GPA1′, ‘ALCEVR1’, ‘WHITE’, ‘BLACK’]]
targets = data_clean.TREG1
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=0.3)
classifier=DecisionTreeClassifier(random_state=RND_STATE) classifier=classifier.fit(pred_train, tar_train) predictions=classifier.predict(pred_test)
print(“Confusion matrix:\n”, sklearn.metrics.confusion_matrix(tar_test,predictions)) print(“Accuracy: “,sklearn.metrics.accuracy_score(tar_test, predictions))
out = StringIO() tree.export_graphviz(classifier, out_file=out, feature_names=[“sex”, “gpa”, “alcohol”, “white”, “black”],proportion=True, filled=True, max_depth=4) graph=pydotplus.graph_from_dot_data(out.getvalue()) img = Image(data=graph.create_png()) img
with open(“output” + “.png”, “wb”) as f: f.write(img.data)
1 note
·
View note