timerules - Tumblr blog

timerules · 4 years

Text

Lasso Regression Model

What is Lasso Regression?

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.

Intro:This week I will try to explain the average daily volume of ethanol consumed per person in past year with a LASSO regression model.

DatasetNational Epidemiological Survey on Alcohol and Related Conditions (NESARC)CSV fileFile descriptionVariablesResponse:

ETOTLCA2 -> ETHANOL: average daily volume of ethanol consumed in past year (ounces).Explanatory:

AGE -> AGE: age (years).S1Q24LB -> WEIGHT: weight (pounds).NUMPERS -> HOUSE_PEOPLE: number of persons in household.S1Q4A -> MARRAIGE: age at first marriage (years).S1Q8D -> WORK: age when first worked full time, 30+ hours a week (years).S1Q12A -> INCOME: total household income in last 12 months (dolars).SEX -> MALE: gender (2 groups).S10Q1A63 -> CHANGE_MIND: change mind about things depending on people you're with or what read or saw on tv (2 groups).All used variables are quantitative.

In [16]:%pylab inline

import numpy as npimport pandas as pd

from sklearn.cross_validation import train_test_splitimport sklearn.metricsfrom sklearn.linear_model import LassoLarsCV

#Visualizationimport matplotlib.pylab as pltimport seaborn as sns

pylab.rcParams['figure.figsize'] = (15, 8)Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['plt']`%matplotlib` prevents importing * from pylab and numpy

DataIn [2]:# Load datadata = pd.read_csv('../datasets/NESARC/nesarc_pds.csv', usecols=['ETOTLCA2','AGE','S1Q24LB','NUMPERS','S1Q4A','S1Q8D','S1Q12A','SEX','S10Q1A63'])In [17]:# Custom dataframedf = pd.DataFrame()

# Response variabledf['ETHANOL'] = data['ETOTLCA2'].replace(' ',np.NaN).astype(float)

# Explanatory variablesdf['AGE'] = data['AGE'].replace(' ',np.NaN).replace('98',np.NaN).astype(float)df['WEIGHT'] = data['S1Q24LB'].replace(' ',np.NaN).replace('999',np.NaN).astype(float)df['HOUSE_PEOPLE'] = data['NUMPERS'].replace(' ',np.NaN).astype(float)df['MARRIAGE'] = data['S1Q4A'].replace(' ',np.NaN).replace('99',np.NaN).astype(float)df['WORK'] = data['S1Q8D'].replace(' ',np.NaN).replace('99',np.NaN).replace('0',np.NaN).astype(float)df['INCOME'] = data['S1Q12A'].replace(' ',np.NaN).astype(float)df['MALE'] = data['SEX'].replace(' ',np.NaN).replace('2','0').astype(float)df['CHANGE_MIND'] = data['S10Q1A63'].replace(' ',np.NaN).replace('9',np.NaN).replace('2','0').astype(float)

df = df.dropna()df.describe()Out[17]:ETHANOL AGE WEIGHT HOUSE_PEOPLE MARRIAGE WORK INCOME MALE CHANGE_MINDcount 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000mean 0.490314 45.146338 174.669628 2.736199 23.537532 19.021297 62375.538642 0.503234 0.148559std 1.211545 13.486408 40.473162 1.422247 5.078884 4.496707 70392.218489 0.500006 0.355665min 0.000300 18.000000 78.000000 1.000000 14.000000 5.000000 24.000000 0.000000 0.00000025% 0.016800 35.000000 145.000000 2.000000 20.000000 17.000000 29999.000000 0.000000 0.00000050% 0.103700 43.000000 170.000000 2.000000 23.000000 18.000000 50000.000000 1.000000 0.00000075% 0.475450 54.000000 198.500000 4.000000 26.000000 21.000000 76000.000000 1.000000 0.000000max 29.676500 94.000000 450.000000 13.000000 63.000000 71.000000 3000000.000000 1.000000 1.000000In [4]:TARGET = 'ETHANOL'PREDICTORS = list(df.columns)PREDICTORS.remove(TARGET)

df_target = df[TARGET]df_predictors = pd.DataFrame()Standardize predictors0 Mean1 Standadrd deviationIn [5]:for predictor in PREDICTORS: pred_data = df[predictor] df_predictors[predictor] = (df[predictor] - df[predictor].mean()) / df[predictor].std()In [6]:df_predictors.describe()Out[6]:AGE WEIGHT HOUSE_PEOPLE MARRIAGE WORK INCOME MALE CHANGE_MINDcount 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04mean -1.290461e-16 -5.756014e-17 -6.127369e-17 1.559694e-16 -8.541181e-17 -1.114067e-17 7.519953e-17 4.270591e-17std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00min -2.012866e+00 -2.388487e+00 -1.220744e+00 -1.877880e+00 -3.118126e+00 -8.857732e-01 -1.006456e+00 -4.176946e-0125% -7.523381e-01 -7.330692e-01 -5.176310e-01 -6.965176e-01 -4.495061e-01 -4.599449e-01 -1.006456e+00 -4.176946e-0150% -1.591483e-01 -1.153759e-01 -5.176310e-01 -1.058366e-01 -2.271212e-01 -1.758083e-01 9.935207e-01 -4.176946e-0175% 6.564878e-01 5.887944e-01 8.885945e-01 4.848444e-01 4.400337e-01 1.935507e-01 9.935207e-01 -4.176946e-01max 3.622437e+00 6.802789e+00 7.216609e+00 7.769910e+00 1.155928e+01 4.173223e+01 9.935207e-01 2.393937e+00Split: train, testIn [7]:train_target, test_target, train_predictors, test_predictors = train_test_split(df_target, df_predictors, test_size=0.3, random_state=42)

print('Samples train: {0}'.format(len(train_target)))print('Samples test: {0}'.format(len(test_target)))Samples train: 10714Samples test: 4593

ModelIn [8]:model1 = LassoLarsCV(cv=10,precompute=False)model1.fit(train_predictors, train_target)Out[8]:LassoLarsCV(copy_X=True, cv=10, eps=2.2204460492503131e-16, fit_intercept=True, max_iter=500, max_n_alphas=1000, n_jobs=1, normalize=True, positive=False, precompute=False, verbose=False)In [9]:print('Alpha parameter: {0}'.format(model1.alpha))Alpha parameter: 0.0

Regression coeficientsIn [10]:coefs = zip(df_predictors.columns,model1.coef_)coefs.sort(key=lambda x: abs(x[1]), reverse=True)print '\n'.join( '{0}: {1}'.format(var,coef) for var,coef in coefs)MALE: 0.25510029979HOUSE_PEOPLE: -0.0664057742071CHANGE_MIND: 0.0554624412918WEIGHT: -0.0550868686148WORK: -0.0437114356411AGE: -0.0349338129742MARRIAGE: -0.0195949134257INCOME: -0.0138293582608

PlotsIn [11]:# plot coefficient progressionm_log_alphas = -np.log10(model1.alphas_)ax = plt.gca()plt.plot(m_log_alphas, model1.coef_path_.T)plt.axvline(-np.log10(model1.alpha_), linestyle='--', color='k',label='alpha CV')plt.ylabel('Regression Coefficients')plt.xlabel('-log(alpha)')plt.title('Regression Coefficients Progression for Lasso Paths')Out[11]:<matplotlib.text.Text at 0x7fc86aa34550>

In [12]:# plot mean square error for each foldm_log_alphascv = -np.log10(model1.cv_alphas_)plt.figure()plt.plot(m_log_alphascv, model1.cv_mse_path_, ':')plt.plot(m_log_alphascv, model1.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)plt.axvline(-np.log10(model1.alpha_), linestyle='--', color='k', label='alpha CV')plt.legend()plt.xlabel('-log(alpha)')plt.ylabel('Mean squared error')plt.title('Mean squared error on each fold')Out[12]:<matplotlib.text.Text at 0x7fc86a8669d0>

MetricsIn [13]:# MSE from training and test dataprint ('MSE training: {0}'.format(sklearn.metrics.mean_squared_error(train_target, model1.predict(train_predictors))))print ('MSE testing: {0}'.format(sklearn.metrics.mean_squared_error(test_target, model1.predict(test_predictors))))MSE training: 1.51194970787MSE testing: 1.15947090575

In [14]:# R-square from training and test dataprint ('R-square training: {0}'.format(model1.score(train_predictors,train_target)))print ('R-square testing: {0}'.format(model1.score(test_predictors,test_target)))R-square training: 0.0396962849816R-square testing: 0.048380464543

0 notes

timerules · 4 years

Text

K-means Cluster analysis

August 25, 2020

Cluster analysis:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

K-Means Clustering:

In this section, you will work with the Uber dataset, which contains data generated by Uber for the city on New York. Uber Technologies Inc. is a peer-to-peer ride sharing platform. Don't worry if you don't know too much about Uber, all you need to know is that the Uber platform connects you with (cab)drivers who can drive you to your destiny. The data is freely available on Kaggle. The dataset contains raw data on Uber pickups with information such as the date, time of the trip along with the longitude-latitude information.

New York city has five boroughs: Brooklyn, Queens, Manhattan, Bronx, and Staten Island. At the end of this mini-project, you will apply k-means clustering on the dataset to explore the dataset better and identify the different boroughs within New York. All along, you will also learn the various steps that you should take when working on a data science project in general.

Problem UnderstandingThere is a lot of information stored in the traffic flow of any city. This data when mined over location can provide information about the major attractions of the city, it can help us understand the various zones of the city such as residential areas, office/school zones, highways, etc. This can help governments and other institutes plan the city better and enforce suitable rules and regulations accordingly. For example, a different speed limit in school and residential zone than compared to highway zones.

The data when monitored over time can help us identify rush hours, holiday season, impact of weather, etc. This knowledge can be applied for better planning and traffic management. This can at a large, impact the efficiency of the city and can also help avoid disasters, or at least faster redirection of traffic flow after accidents.

However, this is all looking at the bigger problem. This tutorial will only concentrate on trying to solve the problem of identifying the five boroughs of New York city using k-means algorithm, so as to get a better understanding of the algorithms, all along learning to tackle a data science problem.

Understanding The DataYou only need to use the Uber data from 2014. You will find the following .csv files in the Kaggle link mentioned above:

uber-raw-data-apr14.csvuber-raw-data-may14.csvuber-raw-data-jun14.csvuber-raw-data-jul14.csvuber-raw-data-aug14.csvuber-raw-data-sep14.csvThis tutorial makes use of various libraries. Remember that when you work locally, you might have to install them. You can easily do so, using install.packages().

Let's now load up the data:

# Load the .csv filesapr14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-apr14.csv")may14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-may14.csv")jun14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-jun14.csv")jul14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-jul14.csv")aug14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-aug14.csv")sep14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-sep14.csv")Let's bind all the data files into one. For this, you can use the bind_rows() function under the dplyr library in R.

library(dplyr)data14 <- bind_rows(apr14, may14, jun14, jul14, aug14, sep14)So far, so good! Let's get a summary of the data to get an idea of what you are dealing with.

summary(data14) Date.Time Lat Lon Base Length:4534327 Min. :39.66 Min. :-74.93 B02512: 205673 Class :character 1st Qu.:40.72 1st Qu.:-74.00 B02598:1393113 Mode :character Median :40.74 Median :-73.98 B02617:1458853 Mean :40.74 Mean :-73.97 B02682:1212789 3rd Qu.:40.76 3rd Qu.:-73.97 B02764: 263899 Max. :42.12 Max. :-72.07 The dataset contains the following columns:

Date.Time : the date and time of the Uber pickup;Lat: the latitude of the Uber pickup;Lon: the longitude of the Uber pickup;Base: the TLC base company code affiliated with the Uber pickup.Data PreparationThis step consists of cleaning and rearranging your data so that you can work on it more easily. It's a good idea to first think of the sparsity of the dataset and check the amount of missing data.

# VIM library for using 'aggr'library(VIM)

# 'aggr' plots the amount of missing/imputed values in each columnaggr(data14)

As you can see, the dataset has no missing values. However, this might not always be the case with real datasets and you will have to decide how you want to deal with these values. Some popular methods include either deleting the particular row/column or replacing with a mean of the value.

You can see that the first column is Date.Time. To be able to use these values, you need to separate them. So let's do that, you can use the lubridate library for this. Lubridate makes it simple for you to identify the order in which the year, month, and day appears in your dates and manipulate them.

library(lubridate)

# Separate or mutate the Date/Time columnsdata14$Date.Time <- mdy_hms(data14$Date.Time)data14$Year <- factor(year(data14$Date.Time))data14$Month <- factor(month(data14$Date.Time))data14$Day <- factor(day(data14$Date.Time))data14$Weekday <- factor(wday(data14$Date.Time))data14$Hour <- factor(hour(data14$Date.Time))data14$Minute <- factor(minute(data14$Date.Time))data14$Second <- factor(second(data14$Date.Time))#data14$date_timedata14$MonthLet's check out the first few rows to see what our data looks like now....

head(data14, n=10)Date.Time Lat Lon Base Year Month Day Weekday Hour Minute Second2014-04-01 00:11:00 40.7690 -73.9549 B02512 2014 4 1 3 0 11 02014-04-01 00:17:00 40.7267 -74.0345 B02512 2014 4 1 3 0 17 02014-04-01 00:21:00 40.7316 -73.9873 B02512 2014 4 1 3 0 21 02014-04-01 00:28:00 40.7588 -73.9776 B02512 2014 4 1 3 0 28 02014-04-01 00:33:00 40.7594 -73.9722 B02512 2014 4 1 3 0 33 02014-04-01 00:33:00 40.7383 -74.0403 B02512 2014 4 1 3 0 33 02014-04-01 00:39:00 40.7223 -73.9887 B02512 2014 4 1 3 0 39 02014-04-01 00:45:00 40.7620 -73.9790 B02512 2014 4 1 3 0 45 02014-04-01 00:55:00 40.7524 -73.9960 B02512 2014 4 1 3 0 55 02014-04-01 01:01:00 40.7575 -73.9846 B02512 2014 4 1 3 1 1 0

0 notes

timerules · 4 years

Photo

RANDOM FOREST ANALYSIS

Code:

import xlrd

import numpy as np import seaborn import numpy as np import matplotlib.pyplot as matplotlib

from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split

from matplotlib.lines import Line2D from scipy.stats import pearsonr

# set seed to make results reproducible RF_SEED = 30 def load_input(excel_file): y_prediction = [] data = [] feature_names = []

loc = (excel_file) wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index(0) sheet.cell_value(0, 0)

for index_row in range(0, 415): row = sheet.row_values(index_row) row = row[1:]

if index_row == 0: feature_names = row else: row[0] = str(row[0]).split(".")[0] data.append([float(x) for x in row[:-1]]) y_prediction.append(float(row[-1]))

return y_prediction, data, feature_names[:-1] def split_data_train_model(labels, data): # 20% examples in test data train, test, train_labels, test_labels = train_test_split(data, labels, test_size=0.2, random_state=RF_SEED)

# training data fit regressor = RandomForestRegressor(n_estimators=1000, random_state=RF_SEED) regressor.fit(x_data, y_data)

return test, test_labels, regressor y_data, x_data, feature_names = load_input("regression_dataset.xlsx") x_test, x_test_labels, regressor = split_data_train_model(y_data, x_data)

predictions = regressor.predict(x_test) # find the correlation between real answer and prediction correlation = round(pearsonr(predictions, x_test_labels)[0], 5)

output_filename = "rf_regression.png" title_name = "Random Forest Regression - Real House Price vs Predicted House Price - correlation ({})".format(correlation) x_axis_label = "Real House Price" y_axis_label = "Predicted House Price"

# plot data simple_scatter_plot(x_test_labels, predictions, output_filename, title_name, x_axis_label, y_axis_label)

0 notes

timerules · 4 years

Photo

Running a classification tree

Running on sublime text 3

ctrl+b to run

Code:

from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics

os.chdir("C:\TREES")

""" Data Engineering and Analysis """ #Load the dataset

AH_data = pd.read_csv("tree_addhealth.csv")

data_clean = AH_data.dropna()

data_clean.dtypes data_clean.describe()

""" Modeling and Prediction """ #Split into training and testing sets

predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1', 'ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV', 'PARPRES']]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape pred_test.shape tar_train.shape tar_test.shape

#Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())

1 note · View note