Tumgik
jccourseraworks · 2 years
Text
K-Means Cluster Example
We create a set of 200 samples usíng isotropic Gaussian blobs for clustering. Then we use the k-means function to find the centroids and other parameters. These are the results:
The number of iterations required to converge is: 2 The locations of the centroids are: [[-0.25813925 1.05589975] [-0.91941183 -1.18551732] [ 1.19539276 0.13158148]] The lowest SSE value is: 74.57960106819851
This is the code:
""" Name: K-Mean Cluster Example Author: jcgomez Date: 05/12/2022 """
#Import
import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.preprocessing import StandardScaler
#Creating the Data Set
features, true_labels = make_blobs( n_samples=200, centers=3, cluster_std=2.75, random_state=42 )
#Standarization: mean 0, standard deviation 1
scaler = StandardScaler() scaled_features = scaler.fit_transform(features)
#Initiate K-Means
kmeans = KMeans( init="random", n_clusters=3, n_init=10, max_iter=300, random_state=42 )
#Execute
kmeans.fit(scaled_features)
#Print Results:
print("The number of iterations required to converge is: ", kmeans.n_iter_) print("The locations of the centroids are: ", kmeans.cluster_centers_) print("The lowest SSE value is: ", kmeans.inertia_)
0 notes
jccourseraworks · 2 years
Text
Lasso Regresion on Boston Housing Data Set
The housing dataset comprises 506 rows of data with 13 numerical input variables and a numerical target variable. This is the description of the variables (14th is the target):
1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 14. MEDV Median value of owner-occupied homes in $1000's.
Using a 10-fold cross-validation with three repeats, the model achieves a mean absolute error (MAE) of: 3.711
This is the code:
""" Name: Lasso Regresion on Boston Housing Data Set Author: jcgomez Date: 05/12/2022 """
#Import
from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Lasso
#load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1]
#define model
model = Lasso(alpha=1.0)
#define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
#evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
#force scores to be positive
scores = absolute(scores)
#result
print('Mean MAE: ', mean(scores))
0 notes
jccourseraworks · 2 years
Text
Random Forest Example using California housing information
Basically, we have some variables representing socio-economic information in California from the 1990 census. These are the variables:
MedInc median income in block group HouseAge median house age in block group AveRooms average number of rooms per household AveBedrms average number of bedrooms per household Population block group population AveOccup average number of household members Latitude block group latitude Longitude block group longitude
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).
Firstly, we train the model. Finally, with a test value we calculate the rmse error of that test.
The error (rmse) is: 0.5183576795210798
This is the code:
""" Title: Random Forest Example Autohr: jcgomez Date: 05/12/2022 """
#Import
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import fetch_california_housing california = fetch_california_housing() from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import GridSearchCV from sklearn.model_selection import ParameterGrid from sklearn.inspection import permutation_importance import multiprocessing
datos = np.column_stack((california.data, california.target)) datos = pd.DataFrame(datos,columns = np.append(california.feature_names, "MEDV"))
#Training and Test data sets
X_train, X_test, y_train, y_test = train_test_split( datos.drop(columns = "MEDV"), datos['MEDV'], random_state = 123 )
#Model
model = RandomForestRegressor( n_estimators = 10, criterion = 'squared_error', max_depth = None, max_features = 1.0, oob_score = False, n_jobs = -1, random_state = 123 )
#Training the model
model.fit(X_train, y_train)
#Test Error
predict = model.predict(X = X_test)
rmse = mean_squared_error( y_true = y_test, y_pred = predict, squared = False )
print("The error (rmse) is: ", rmse)
0 notes
jccourseraworks · 2 years
Text
Classification Tree based on Fisher's Iris data set
Basically, we have 50 samples of measures of the length and the width of the sepals and petals of three species of Iris (Setosa, Virginica and Versicolor) and with a new set of measures we want to distinguish which specie we have.
This is the classification tree obtained:
Tumblr media
And this is the phython code used:
""" Testing Classification Trees author: jcgomez date: 05/12/2022 """
from sklearn.datasets import load_iris from sklearn import tree from matplotlib import pyplot as plt iris = load_iris() X,Y = iris.data, iris.target classf = tree.DecisionTreeClassifier() output = classf.fit(X,Y) fig = plt.figure(figsize=(25,20)) tree.plot_tree(output,feature_names=iris.feature_names, class_names=iris.target_names) fig.savefig("decistion_tree.png")
1 note · View note