Tumgik
timerules · 4 years
Text
Lasso Regression Model
What is Lasso Regression?
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.
Intro:This week I will try to explain the average daily volume of ethanol consumed per person in past year with a LASSO regression model.
DatasetNational Epidemiological Survey on Alcohol and Related Conditions (NESARC)CSV fileFile descriptionVariablesResponse:
ETOTLCA2 -> ETHANOL: average daily volume of ethanol consumed in past year (ounces).Explanatory:
AGE -> AGE: age (years).S1Q24LB -> WEIGHT: weight (pounds).NUMPERS -> HOUSE_PEOPLE: number of persons in household.S1Q4A -> MARRAIGE: age at first marriage (years).S1Q8D -> WORK: age when first worked full time, 30+ hours a week (years).S1Q12A -> INCOME: total household income in last 12 months (dolars).SEX -> MALE: gender (2 groups).S10Q1A63 -> CHANGE_MIND: change mind about things depending on people you're with or what read or saw on tv (2 groups).All used variables are quantitative.
In [16]:%pylab inline
import numpy as npimport pandas as pd
from sklearn.cross_validation import train_test_splitimport sklearn.metricsfrom sklearn.linear_model import LassoLarsCV
#Visualizationimport matplotlib.pylab as pltimport seaborn as sns
pylab.rcParams['figure.figsize'] = (15, 8)Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['plt']`%matplotlib` prevents importing * from pylab and numpy
DataIn [2]:# Load datadata = pd.read_csv('../datasets/NESARC/nesarc_pds.csv', usecols=['ETOTLCA2','AGE','S1Q24LB','NUMPERS','S1Q4A','S1Q8D','S1Q12A','SEX','S10Q1A63'])In [17]:# Custom dataframedf = pd.DataFrame()
# Response variabledf['ETHANOL'] = data['ETOTLCA2'].replace(' ',np.NaN).astype(float)
# Explanatory variablesdf['AGE'] = data['AGE'].replace(' ',np.NaN).replace('98',np.NaN).astype(float)df['WEIGHT'] = data['S1Q24LB'].replace(' ',np.NaN).replace('999',np.NaN).astype(float)df['HOUSE_PEOPLE'] = data['NUMPERS'].replace(' ',np.NaN).astype(float)df['MARRIAGE'] = data['S1Q4A'].replace(' ',np.NaN).replace('99',np.NaN).astype(float)df['WORK'] = data['S1Q8D'].replace(' ',np.NaN).replace('99',np.NaN).replace('0',np.NaN).astype(float)df['INCOME'] = data['S1Q12A'].replace(' ',np.NaN).astype(float)df['MALE'] = data['SEX'].replace(' ',np.NaN).replace('2','0').astype(float)df['CHANGE_MIND'] = data['S10Q1A63'].replace(' ',np.NaN).replace('9',np.NaN).replace('2','0').astype(float)
df = df.dropna()df.describe()Out[17]:ETHANOL AGE WEIGHT HOUSE_PEOPLE MARRIAGE WORK INCOME MALE CHANGE_MINDcount 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000 15307.000000mean 0.490314 45.146338 174.669628 2.736199 23.537532 19.021297 62375.538642 0.503234 0.148559std 1.211545 13.486408 40.473162 1.422247 5.078884 4.496707 70392.218489 0.500006 0.355665min 0.000300 18.000000 78.000000 1.000000 14.000000 5.000000 24.000000 0.000000 0.00000025% 0.016800 35.000000 145.000000 2.000000 20.000000 17.000000 29999.000000 0.000000 0.00000050% 0.103700 43.000000 170.000000 2.000000 23.000000 18.000000 50000.000000 1.000000 0.00000075% 0.475450 54.000000 198.500000 4.000000 26.000000 21.000000 76000.000000 1.000000 0.000000max 29.676500 94.000000 450.000000 13.000000 63.000000 71.000000 3000000.000000 1.000000 1.000000In [4]:TARGET = 'ETHANOL'PREDICTORS = list(df.columns)PREDICTORS.remove(TARGET)
df_target = df[TARGET]df_predictors = pd.DataFrame()Standardize predictors0 Mean1 Standadrd deviationIn [5]:for predictor in PREDICTORS:    pred_data = df[predictor]     df_predictors[predictor] = (df[predictor] - df[predictor].mean()) / df[predictor].std()In [6]:df_predictors.describe()Out[6]:AGE WEIGHT HOUSE_PEOPLE MARRIAGE WORK INCOME MALE CHANGE_MINDcount 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04 1.530700e+04mean -1.290461e-16 -5.756014e-17 -6.127369e-17 1.559694e-16 -8.541181e-17 -1.114067e-17 7.519953e-17 4.270591e-17std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00min -2.012866e+00 -2.388487e+00 -1.220744e+00 -1.877880e+00 -3.118126e+00 -8.857732e-01 -1.006456e+00 -4.176946e-0125% -7.523381e-01 -7.330692e-01 -5.176310e-01 -6.965176e-01 -4.495061e-01 -4.599449e-01 -1.006456e+00 -4.176946e-0150% -1.591483e-01 -1.153759e-01 -5.176310e-01 -1.058366e-01 -2.271212e-01 -1.758083e-01 9.935207e-01 -4.176946e-0175% 6.564878e-01 5.887944e-01 8.885945e-01 4.848444e-01 4.400337e-01 1.935507e-01 9.935207e-01 -4.176946e-01max 3.622437e+00 6.802789e+00 7.216609e+00 7.769910e+00 1.155928e+01 4.173223e+01 9.935207e-01 2.393937e+00Split: train, testIn [7]:train_target, test_target, train_predictors, test_predictors = train_test_split(df_target, df_predictors, test_size=0.3, random_state=42)
print('Samples train: {0}'.format(len(train_target)))print('Samples test: {0}'.format(len(test_target)))Samples train: 10714Samples test: 4593
ModelIn [8]:model1 = LassoLarsCV(cv=10,precompute=False)model1.fit(train_predictors, train_target)Out[8]:LassoLarsCV(copy_X=True, cv=10, eps=2.2204460492503131e-16,      fit_intercept=True, max_iter=500, max_n_alphas=1000, n_jobs=1,      normalize=True, positive=False, precompute=False, verbose=False)In [9]:print('Alpha parameter: {0}'.format(model1.alpha))Alpha parameter: 0.0
Regression coeficientsIn [10]:coefs = zip(df_predictors.columns,model1.coef_)coefs.sort(key=lambda x: abs(x[1]), reverse=True)print '\n'.join( '{0}: {1}'.format(var,coef) for var,coef in coefs)MALE: 0.25510029979HOUSE_PEOPLE: -0.0664057742071CHANGE_MIND: 0.0554624412918WEIGHT: -0.0550868686148WORK: -0.0437114356411AGE: -0.0349338129742MARRIAGE: -0.0195949134257INCOME: -0.0138293582608
PlotsIn [11]:# plot coefficient progressionm_log_alphas = -np.log10(model1.alphas_)ax = plt.gca()plt.plot(m_log_alphas, model1.coef_path_.T)plt.axvline(-np.log10(model1.alpha_), linestyle='--', color='k',label='alpha CV')plt.ylabel('Regression Coefficients')plt.xlabel('-log(alpha)')plt.title('Regression Coefficients Progression for Lasso Paths')Out[11]:<matplotlib.text.Text at 0x7fc86aa34550>
In [12]:# plot mean square error for each foldm_log_alphascv = -np.log10(model1.cv_alphas_)plt.figure()plt.plot(m_log_alphascv, model1.cv_mse_path_, ':')plt.plot(m_log_alphascv, model1.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)plt.axvline(-np.log10(model1.alpha_), linestyle='--', color='k', label='alpha CV')plt.legend()plt.xlabel('-log(alpha)')plt.ylabel('Mean squared error')plt.title('Mean squared error on each fold')Out[12]:<matplotlib.text.Text at 0x7fc86a8669d0>
MetricsIn [13]:# MSE from training and test dataprint ('MSE training: {0}'.format(sklearn.metrics.mean_squared_error(train_target, model1.predict(train_predictors))))print ('MSE testing: {0}'.format(sklearn.metrics.mean_squared_error(test_target, model1.predict(test_predictors))))MSE training: 1.51194970787MSE testing: 1.15947090575
In [14]:# R-square from training and test dataprint ('R-square training: {0}'.format(model1.score(train_predictors,train_target)))print ('R-square testing: {0}'.format(model1.score(test_predictors,test_target)))R-square training: 0.0396962849816R-square testing: 0.048380464543
0 notes
timerules · 4 years
Text
K-means Cluster analysis
August 25, 2020
Cluster analysis:
                            Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.
K-Means Clustering:
                                  In this section, you will work with the Uber dataset, which contains data generated by Uber for the city on New York. Uber Technologies Inc. is a peer-to-peer ride sharing platform. Don't worry if you don't know too much about Uber, all you need to know is that the Uber platform connects you with (cab)drivers who can drive you to your destiny. The data is freely available on Kaggle. The dataset contains raw data on Uber pickups with information such as the date, time of the trip along with the longitude-latitude information.
New York city has five boroughs: Brooklyn, Queens, Manhattan, Bronx, and Staten Island. At the end of this mini-project, you will apply k-means clustering on the dataset to explore the dataset better and identify the different boroughs within New York. All along, you will also learn the various steps that you should take when working on a data science project in general.
Problem UnderstandingThere is a lot of information stored in the traffic flow of any city. This data when mined over location can provide information about the major attractions of the city, it can help us understand the various zones of the city such as residential areas, office/school zones, highways, etc. This can help governments and other institutes plan the city better and enforce suitable rules and regulations accordingly. For example, a different speed limit in school and residential zone than compared to highway zones.
The data when monitored over time can help us identify rush hours, holiday season, impact of weather, etc. This knowledge can be applied for better planning and traffic management. This can at a large, impact the efficiency of the city and can also help avoid disasters, or at least faster redirection of traffic flow after accidents.
However, this is all looking at the bigger problem. This tutorial will only concentrate on trying to solve the problem of identifying the five boroughs of New York city using k-means algorithm, so as to get a better understanding of the algorithms, all along learning to tackle a data science problem.
Understanding The DataYou only need to use the Uber data from 2014. You will find the following .csv files in the Kaggle link mentioned above:
uber-raw-data-apr14.csvuber-raw-data-may14.csvuber-raw-data-jun14.csvuber-raw-data-jul14.csvuber-raw-data-aug14.csvuber-raw-data-sep14.csvThis tutorial makes use of various libraries. Remember that when you work locally, you might have to install them. You can easily do so, using install.packages().
Let's now load up the data:
# Load the .csv filesapr14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-apr14.csv")may14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-may14.csv")jun14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-jun14.csv")jul14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-jul14.csv")aug14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-aug14.csv")sep14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-sep14.csv")Let's bind all the data files into one. For this, you can use the bind_rows() function under the dplyr library in R.
library(dplyr)data14 <- bind_rows(apr14, may14, jun14, jul14, aug14, sep14)So far, so good! Let's get a summary of the data to get an idea of what you are dealing with.
summary(data14)  Date.Time Lat Lon Base         Length:4534327 Min. :39.66 Min. :-74.93 B02512: 205673   Class :character 1st Qu.:40.72 1st Qu.:-74.00 B02598:1393113   Mode :character Median :40.74 Median :-73.98 B02617:1458853                      Mean :40.74 Mean :-73.97 B02682:1212789                      3rd Qu.:40.76 3rd Qu.:-73.97 B02764: 263899                      Max. :42.12 Max. :-72.07                   The dataset contains the following columns:
Date.Time : the date and time of the Uber pickup;Lat: the latitude of the Uber pickup;Lon: the longitude of the Uber pickup;Base: the TLC base company code affiliated with the Uber pickup.Data PreparationThis step consists of cleaning and rearranging your data so that you can work on it more easily. It's a good idea to first think of the sparsity of the dataset and check the amount of missing data.
# VIM library for using 'aggr'library(VIM)
# 'aggr' plots the amount of missing/imputed values in each columnaggr(data14)
As you can see, the dataset has no missing values. However, this might not always be the case with real datasets and you will have to decide how you want to deal with these values. Some popular methods include either deleting the particular row/column or replacing with a mean of the value.
You can see that the first column is Date.Time. To be able to use these values, you need to separate them. So let's do that, you can use the lubridate library for this. Lubridate makes it simple for you to identify the order in which the year, month, and day appears in your dates and manipulate them.
library(lubridate)
# Separate or mutate the Date/Time columnsdata14$Date.Time <- mdy_hms(data14$Date.Time)data14$Year <- factor(year(data14$Date.Time))data14$Month <- factor(month(data14$Date.Time))data14$Day <- factor(day(data14$Date.Time))data14$Weekday <- factor(wday(data14$Date.Time))data14$Hour <- factor(hour(data14$Date.Time))data14$Minute <- factor(minute(data14$Date.Time))data14$Second <- factor(second(data14$Date.Time))#data14$date_timedata14$MonthLet's check out the first few rows to see what our data looks like now....
head(data14, n=10)Date.Time Lat Lon Base Year Month Day Weekday Hour Minute Second2014-04-01 00:11:00 40.7690 -73.9549 B02512 2014 4 1 3 0 11 02014-04-01 00:17:00 40.7267 -74.0345 B02512 2014 4 1 3 0 17 02014-04-01 00:21:00 40.7316 -73.9873 B02512 2014 4 1 3 0 21 02014-04-01 00:28:00 40.7588 -73.9776 B02512 2014 4 1 3 0 28 02014-04-01 00:33:00 40.7594 -73.9722 B02512 2014 4 1 3 0 33 02014-04-01 00:33:00 40.7383 -74.0403 B02512 2014 4 1 3 0 33 02014-04-01 00:39:00 40.7223 -73.9887 B02512 2014 4 1 3 0 39 02014-04-01 00:45:00 40.7620 -73.9790 B02512 2014 4 1 3 0 45 02014-04-01 00:55:00 40.7524 -73.9960 B02512 2014 4 1 3 0 55 02014-04-01 01:01:00 40.7575 -73.9846 B02512 2014 4 1 3 1 1 0
0 notes
timerules · 4 years
Photo
Tumblr media Tumblr media Tumblr media Tumblr media
RANDOM FOREST ANALYSIS
Code:
import xlrd
import numpy as np import seaborn import numpy as np import matplotlib.pyplot as matplotlib
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split
from matplotlib.lines import Line2D from scipy.stats import pearsonr
# set seed to make results reproducible RF_SEED = 30 def load_input(excel_file):    y_prediction = []    data = []    feature_names = []
   loc = (excel_file)    wb = xlrd.open_workbook(loc)    sheet = wb.sheet_by_index(0)    sheet.cell_value(0, 0)
   for index_row in range(0, 415):        row = sheet.row_values(index_row)        row = row[1:]
       if index_row == 0:            feature_names = row        else:            row[0] = str(row[0]).split(".")[0]            data.append([float(x) for x in row[:-1]])            y_prediction.append(float(row[-1]))
   return y_prediction, data, feature_names[:-1] def split_data_train_model(labels, data):    # 20% examples in test data    train, test, train_labels, test_labels = train_test_split(data,                                                              labels,                                                              test_size=0.2,                                                              random_state=RF_SEED)
   # training data fit    regressor = RandomForestRegressor(n_estimators=1000, random_state=RF_SEED)    regressor.fit(x_data, y_data)
   return test, test_labels, regressor y_data, x_data, feature_names = load_input("regression_dataset.xlsx") x_test, x_test_labels, regressor = split_data_train_model(y_data, x_data)
predictions = regressor.predict(x_test) # find the correlation between real answer and prediction correlation = round(pearsonr(predictions, x_test_labels)[0], 5)
output_filename = "rf_regression.png" title_name = "Random Forest Regression - Real House Price vs Predicted House Price - correlation ({})".format(correlation) x_axis_label = "Real House Price" y_axis_label = "Predicted House Price"
# plot data simple_scatter_plot(x_test_labels, predictions, output_filename, title_name, x_axis_label, y_axis_label)
0 notes
timerules · 4 years
Photo
Tumblr media
Running a classification tree
Running on sublime text 3
ctrl+b to run
Code:
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics
os.chdir("C:\TREES")
""" Data Engineering and Analysis """ #Load the dataset
AH_data = pd.read_csv("tree_addhealth.csv")
data_clean = AH_data.dropna()
data_clean.dtypes data_clean.describe()
""" Modeling and Prediction """ #Split into training and testing sets
predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1', 'ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV', 'PARPRES']]
targets = data_clean.TREG1
pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)
pred_train.shape pred_test.shape tar_train.shape tar_test.shape
#Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)
#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())
1 note · View note