jccourseraworks - Tumblr blog

jccourseraworks · 2 years

Text

K-Means Cluster Example

We create a set of 200 samples usíng isotropic Gaussian blobs for clustering. Then we use the k-means function to find the centroids and other parameters. These are the results:

The number of iterations required to converge is: 2 The locations of the centroids are: [[-0.25813925 1.05589975] [-0.91941183 -1.18551732] [ 1.19539276 0.13158148]] The lowest SSE value is: 74.57960106819851

This is the code:

""" Name: K-Mean Cluster Example Author: jcgomez Date: 05/12/2022 """

#Import

import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.preprocessing import StandardScaler

#Creating the Data Set

features, true_labels = make_blobs( n_samples=200, centers=3, cluster_std=2.75, random_state=42 )

#Standarization: mean 0, standard deviation 1

scaler = StandardScaler() scaled_features = scaler.fit_transform(features)

#Initiate K-Means

kmeans = KMeans( init="random", n_clusters=3, n_init=10, max_iter=300, random_state=42 )

#Execute

kmeans.fit(scaled_features)

#Print Results:

print("The number of iterations required to converge is: ", kmeans.n_iter_) print("The locations of the centroids are: ", kmeans.cluster_centers_) print("The lowest SSE value is: ", kmeans.inertia_)

0 notes

jccourseraworks · 2 years

Text

Lasso Regresion on Boston Housing Data Set

The housing dataset comprises 506 rows of data with 13 numerical input variables and a numerical target variable. This is the description of the variables (14th is the target):

1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 14. MEDV Median value of owner-occupied homes in $1000's.

Using a 10-fold cross-validation with three repeats, the model achieves a mean absolute error (MAE) of: 3.711

This is the code:

""" Name: Lasso Regresion on Boston Housing Data Set Author: jcgomez Date: 05/12/2022 """

#Import

from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Lasso

#load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1]

#define model

model = Lasso(alpha=1.0)

#define model evaluation method

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

#evaluate model

scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

#force scores to be positive

scores = absolute(scores)

#result

print('Mean MAE: ', mean(scores))

0 notes

jccourseraworks · 2 years

Text

Random Forest Example using California housing information

Basically, we have some variables representing socio-economic information in California from the 1990 census. These are the variables:

MedInc median income in block group HouseAge median house age in block group AveRooms average number of rooms per household AveBedrms average number of bedrooms per household Population block group population AveOccup average number of household members Latitude block group latitude Longitude block group longitude

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

Firstly, we train the model. Finally, with a test value we calculate the rmse error of that test.

The error (rmse) is: 0.5183576795210798

This is the code:

""" Title: Random Forest Example Autohr: jcgomez Date: 05/12/2022 """

#Import

import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import fetch_california_housing california = fetch_california_housing() from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import GridSearchCV from sklearn.model_selection import ParameterGrid from sklearn.inspection import permutation_importance import multiprocessing

datos = np.column_stack((california.data, california.target)) datos = pd.DataFrame(datos,columns = np.append(california.feature_names, "MEDV"))

#Training and Test data sets

X_train, X_test, y_train, y_test = train_test_split( datos.drop(columns = "MEDV"), datos['MEDV'], random_state = 123 )

#Model

model = RandomForestRegressor( n_estimators = 10, criterion = 'squared_error', max_depth = None, max_features = 1.0, oob_score = False, n_jobs = -1, random_state = 123 )

#Training the model

model.fit(X_train, y_train)

#Test Error

predict = model.predict(X = X_test)

rmse = mean_squared_error( y_true = y_test, y_pred = predict, squared = False )

print("The error (rmse) is: ", rmse)

0 notes

jccourseraworks · 2 years

Text

Classification Tree based on Fisher's Iris data set

Basically, we have 50 samples of measures of the length and the width of the sepals and petals of three species of Iris (Setosa, Virginica and Versicolor) and with a new set of measures we want to distinguish which specie we have.

This is the classification tree obtained:

And this is the phython code used:

""" Testing Classification Trees author: jcgomez date: 05/12/2022 """

from sklearn.datasets import load_iris from sklearn import tree from matplotlib import pyplot as plt iris = load_iris() X,Y = iris.data, iris.target classf = tree.DecisionTreeClassifier() output = classf.fit(X,Y) fig = plt.figure(figsize=(25,20)) tree.plot_tree(output,feature_names=iris.feature_names, class_names=iris.target_names) fig.savefig("decistion_tree.png")

1 note · View note