financedata - Tumblr blog

financedata · 6 years ago

Photo

#I used sklearn's Boston dataset to cluster Boston's housing prices. First, I standardized the inputs as x and set the target (price) as y. The Elbow Method shows that bending is most pronounced at k=3. The scatter plot shows that the clusters are mostly distinct but overlap a bit. The clustering variable means show that each clusters have quite different mean values for each of the inputs. However, OLS analysis shows that both the R-Squared are P-values are weak (except for the intercepts). Therefore, it seems predicting price off the generated clusters may not be the best approach.

#I already had a limited understanding of K-Means, but here I learned that K-Means is a recursive algorithm to minimize the distances from the centroids to observations, per variable; the first step is random, and that observations can chagne clusters.

import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans from sklearn.datasets import load_boston from scipy.spatial.distance import cdist from sklearn.decomposition import PCA import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

boston = load_boston() x = pd.DataFrame(data=preprocessing.scale(boston.data.astype('float64')), columns=boston.feature_names) #Standardize inputs y = boston.target #3 possible outcomes: [0, 1, 2] x_train, x_test = train_test_split(x, test_size=.3, random_state=1)

clusters=range(1,10) meandist=[]

for k in clusters: model = KMeans(n_clusters=k) model.fit(x_train) clusassign=model.predict(x_train) meandist.append(sum(np.min(cdist(x_train, model.cluster_centers_, 'euclidean'), axis=1)) / x_train.shape[0])

plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') plt.show() #Bending looks most pronounced at k=3

model3 = KMeans(n_clusters=3) model3.fit(x_train) clusassign=model3.predict(x_train)

pca_2 = PCA(2) plot_columns = pca_2.fit_transform(x_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()

x_train.reset_index(level=0, inplace=True) cluslist=list(x_train['index']) labels=list(model3.labels_) newlist=dict(zip(cluslist, labels)) newclus=pd.DataFrame.from_dict(newlist, orient='index') newclus.columns = ['cluster']

newclus.reset_index(level=0, inplace=True) merged_train=pd.merge(x_train, newclus, on='index')

# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)

y_train, y_test = train_test_split(y, test_size=.3, random_state=123) y_train1=pd.DataFrame(y_train) y_train1.reset_index(level=0, inplace=True)

y_train1.columns = ['index','target'] merged_train_all=pd.merge(y_train1, merged_train, on='index') sub1 = merged_train_all[['target', 'cluster']].dropna()

boston_mod = smf.ols(formula='target ~ C(cluster)', data=sub1).fit() print(boston_mod.summary())

print ('means for Boston by cluster') m1= sub1.groupby('cluster').mean() print(m1)

print ('standard deviations for Boston by cluster') m2= sub1.groupby('cluster').std() print(m2)

mc1 = multi.MultiComparison(sub1['target'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())

0 notes

financedata · 6 years ago

Photo

#Python code below runs a lasso regression on SKLearn’s Boston Housing dataset. The target is the price. The lasso shrunk the estimate for Age to 0. Different runs sometimes produced different results, because of the randomness splitting of the data in train_test_split and in kfold. (The other zeroed predictor was INDUS, not shown here.) I learned how to run Lasso regressions to eliminate/shrink variables down.

import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LassoLarsCV from sklearn import preprocessing from sklearn.datasets import load_boston from random import randrange

boston = load_boston() x = pd.DataFrame(boston.data, columns=boston.feature_names) x_strd = pd.DataFrame(data=preprocessing.scale(x.astype('float64')), columns=x.columns) y = boston.target x_train, x_test, y_train, y_test = train_test_split(x_strd, y, test_size=.3, random_state=randrange(1,100)) model = LassoLarsCV(cv=10, precompute=False).fit(x_train, y_train)

dict(zip(x.columns, model.coef_))

# plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.mse_path_, ':') plt.plot(m_log_alphascv, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')

# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)

# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)

0 notes

financedata · 6 years ago

Photo

#The below script loads the wine dataset, splits train/test and predictors/targets, and runs both the Random Forest Classifier and Extra Trees Classifier. It loops from 1 to 25 n_estimators (tree splits) and only prints the most accurate results’ accuracy, cross validation results, feature important list, and confusion matrix. Then the plot of different accuracies in this loop are graphed.

#In run imaged here, both had the same accuracy score, but not every run did. That’s because the training and test sets are randomly selected. Both models show flavanoids and proline are the most important predictor variables.

#I learned how to construct an ML loop for multiple models and see various metrics for each, as well as how to get the names alongside the feature importance values.

import pandas as pd import numpy as np from sklearn.datasets import load_wine import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report import sklearn.metrics from sklearn import datasets, model_selection from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

wine = load_wine() wineDF = pd.DataFrame(wine.data, columns=wine.feature_names) predictors = wineDF[['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']] targets = wine.target

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

def forest_builder(model, n_estimators, kfold, pred_train, tar_train, pred_test, tar_test, accuracy_list): accuracy=0 for idx in range(n_estimators): model.n_estimators = idx+1 model.fit(pred_train,tar_train) predictions = model.predict(pred_test)

accuracy = sklearn.metrics.accuracy_score(tar_test, predictions) if accuracy >= max(accuracy_list): cv_results = model_selection.cross_val_score(model, pred_train, tar_train, cv = kfold, scoring = 'accuracy') cm = sklearn.metrics.confusion_matrix(tar_test,predictions) accuracy = sklearn.metrics.accuracy_score(tar_test, predictions) named_features = pd.DataFrame(model.feature_importances_, index = pred_train.columns, columns=['importance']).sort_values('importance', ascending=False)

accuracy_list[idx] = accuracy

print(name, 'n_estimators = ' + str(idx+1) ,'\n Accuracy = ' + str(accuracy_list[idx]) ,'\n Cross Validation Results: \n' + str(cv_results) ,'\n Feature Importance: \n', named_features ,'\n Confusion Matrix: \n', cm ,'\n')

plt.cla() plt.plot(range(n_estimators), accuracy_list) plt.show()

models = [['Random Forest', RandomForestClassifier()], ['Extra Trees', ExtraTreesClassifier()]] n_estimators = 25 accuracy_list = np.zeros(n_estimators) kfold=3

for name, model in models: forest_builder(model, n_estimators, kfold, pred_train, tar_train, pred_test, tar_test, accuracy_list)

0 notes

financedata · 6 years ago

Photo

#The below Python code imports the wine dataset from sklearn, which has three classification types (0, 1, and 2) that get used as targets. All the other data attributes are loaded as predictors. They get fitted into the Decision Tree, and it predicts a 40% test set with almost 96% accuracy -- at least on the seeded 60/40 partition selected. The bottom, terminal nodes of the tree only classify into one value (as the others are 0).

#Update on Confusion Matrix: The confusion matrix shows the number of correct & incorrect classifications for all possibilities. It’s a 3x3 grid as there were 3 possible outputs, but each predictions could either be correct or the other 2 wrong answers. The diagonal line from the top-left to the bottom-right are correct predictions. All else are errors. The top-left position is correct predictions for classification 1, bottom right is correct predictions for classification 3. The top-middle shows that the model incorrectly predicted classification 1 that was actually classification 2.

#Update on Learnings: I had a partial understanding of decision trees beforehand. I learned why decision trees are sometimes better than regressions: certain predictors variables are interactive, which means for some observations, some inputs may be irrelevant, but only for part of the sample/population. Other observations may have other different relevant and irrelevant variables. Therefore, assigning a one-size-fits-all coefficient value in regression isn’t always appropriate. Also, I learned how to produce the visualizations to show the tree output, which makes the model more explainable.

import pandas as pd from sklearn.datasets import load_wine from sklearn.tree import DecisionTreeClassifier from sklearn import tree from sklearn.model_selection import train_test_split import sklearn.metrics from io import StringIO from IPython.display import Image import pydotplus

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

dtree = DecisionTreeClassifier().fit(pred_train,tar_train) predictions = dtree.predict(pred_test)

print('Confusion Matrix:\n',sklearn.metrics.confusion_matrix(tar_test,predictions)) print('Accuracy Score:\n',sklearn.metrics.accuracy_score(tar_test, predictions))

out = StringIO() tree.export_graphviz(dtree, out_file=out) graph = pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())

1 note · View note