K-Means Cluster Example
We create a set of 200 samples usíng isotropic Gaussian blobs for clustering. Then we use the k-means function to find the centroids and other parameters. These are the results:
The number of iterations required to converge is: 2
The locations of the centroids are: [[-0.25813925 1.05589975]
[-0.91941183 -1.18551732]
[ 1.19539276 0.13158148]]
The lowest SSE value is: 74.57960106819851
This is the code:
"""
Name: K-Mean Cluster Example
Author: jcgomez
Date: 05/12/2022
"""
#Import
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
#Creating the Data Set
features, true_labels = make_blobs(
n_samples=200,
centers=3,
cluster_std=2.75,
random_state=42
)
#Standarization: mean 0, standard deviation 1
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
#Initiate K-Means
kmeans = KMeans(
init="random",
n_clusters=3,
n_init=10,
max_iter=300,
random_state=42
)
#Execute
kmeans.fit(scaled_features)
#Print Results:
print("The number of iterations required to converge is: ", kmeans.n_iter_)
print("The locations of the centroids are: ", kmeans.cluster_centers_)
print("The lowest SSE value is: ", kmeans.inertia_)
0 notes
Lasso Regresion on Boston Housing Data Set
The housing dataset comprises 506 rows of data with 13 numerical input variables and a numerical target variable. This is the description of the variables (14th is the target):
1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 14. MEDV Median value of owner-occupied homes in $1000's.
Using a 10-fold cross-validation with three repeats, the model achieves a mean absolute error (MAE) of: 3.711
This is the code:
"""
Name: Lasso Regresion on Boston Housing Data Set
Author: jcgomez
Date: 05/12/2022
"""
#Import
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Lasso
#load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
#define model
model = Lasso(alpha=1.0)
#define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
#evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
#force scores to be positive
scores = absolute(scores)
#result
print('Mean MAE: ', mean(scores))
0 notes
Random Forest Example using California housing information
Basically, we have some variables representing socio-economic information in California from the 1990 census. These are the variables:
MedInc median income in block group
HouseAge median house age in block group
AveRooms average number of rooms per household
AveBedrms average number of bedrooms per household
Population block group population
AveOccup average number of household members
Latitude block group latitude
Longitude block group longitude
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).
Firstly, we train the model. Finally, with a test value we calculate the rmse error of that test.
The error (rmse) is: 0.5183576795210798
This is the code:
"""
Title: Random Forest Example
Autohr: jcgomez
Date: 05/12/2022
"""
#Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.inspection import permutation_importance
import multiprocessing
datos = np.column_stack((california.data, california.target))
datos = pd.DataFrame(datos,columns = np.append(california.feature_names, "MEDV"))
#Training and Test data sets
X_train, X_test, y_train, y_test = train_test_split(
datos.drop(columns = "MEDV"),
datos['MEDV'],
random_state = 123
)
#Model
model = RandomForestRegressor(
n_estimators = 10,
criterion = 'squared_error',
max_depth = None,
max_features = 1.0,
oob_score = False,
n_jobs = -1,
random_state = 123
)
#Training the model
model.fit(X_train, y_train)
#Test Error
predict = model.predict(X = X_test)
rmse = mean_squared_error(
y_true = y_test,
y_pred = predict,
squared = False
)
print("The error (rmse) is: ", rmse)
0 notes
Classification Tree based on Fisher's Iris data set
Basically, we have 50 samples of measures of the length and the width of the sepals and petals of three species of Iris (Setosa, Virginica and Versicolor) and with a new set of measures we want to distinguish which specie we have.
This is the classification tree obtained:
And this is the phython code used:
"""
Testing Classification Trees
author: jcgomez
date: 05/12/2022
"""
from sklearn.datasets import load_iris
from sklearn import tree
from matplotlib import pyplot as plt
iris = load_iris()
X,Y = iris.data, iris.target
classf = tree.DecisionTreeClassifier()
output = classf.fit(X,Y)
fig = plt.figure(figsize=(25,20))
tree.plot_tree(output,feature_names=iris.feature_names,
class_names=iris.target_names)
fig.savefig("decistion_tree.png")
1 note
·
View note