optimisticjellyfishheart-blog
optimisticjellyfishheart-blog
A trip to Data Science
9 posts
Lets get this party started ;)
Don't wanna be here? Send us removal request.
Text
DESTINATION : Purpose of pickling a model..
ROUTE :  When we are working with sets of data in the form of dictionaries, DataFrames, or any other data type , we might want to save them to a file, so we can use them later on or send them to someone else. This is what Python's pickle module is for: it serializes objects so they can be saved to a file, and loaded in a program again later on.
Pickling -  Pickle is used for serializing and de-serializing Python object structures, also called marshalling or flattening. Serialization refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network. Later on, this character stream can then be retrieved and de-serialized back to a Python object. Pickling is not to be confused with compression! The former is the conversion of an object from one representation (data in Random Access Memory (RAM)) to another (text on disk), while the latter is the process of encoding data with fewer bits, in order to save disk space.
Tumblr media
0 notes
Text
DESTINATION : Line Plot
ROUTE :  Line plots are commonly used to plot relationships between two numeric lists of values. A line plot, as the name suggests, draws a line that shows positive or negative trends for the data on the y-axis, with respect to an increase or decrease in values on the x-axis. For instance, a line plot can be used to plot monthly or yearly stock prices and the temperature changes over a certain period.
Line Plot with Real World Dataset - Let’s now see how we can draw line plots with the Seaborn library using a real world dataset. We will use the tips dataset to draw line plots. The tips dataset contains records of bills paid by different customers at a restaurant. The dataset comes built-in with the Seaborn library. The following script imports the tips dataset and displays the header of the dataset.
dataset = sns.load_dataset("tips") dataset.head()
Output: 
Tumblr media
You can see that the dataset contains six columns. Let’s now plot a line plot that shows a relationship between the size(number of people) of the group, and the total bill paid.
sns.lineplot(x='size',y='total_bill', data=dataset, ci= None)
Note : ci is confidence interval. The semi-transparent region around the line in a seaborn line plot shows the confidence interval. By default, seaborn line plots show confidence intervals for the dataset. You can remove the confidence interval by setting the ci parameter of the lineplot() function to None
Output:
Tumblr media
0 notes
Text
DESTINATION : Scatter Plot
ROUTE : Scatter plots (also called scatter graphs, scatter charts, scatter diagrams and scattergrams ) are similar to line graphs. A line graph uses a line on an X-Y axis to plot a continuous function, while a scatter plot uses dots to represent individual pieces of data. In statistics, these plots are useful to see if two variables are related to each other. For example, a scatter chart can suggest a linear relationship (i.e. a straight line).Scatter plot suggesting a linear relationship.
          Scatter Plot suggesting a linear relationship
Tumblr media
Correlation in Scatter Plots - The relationship between variables is called correlation. Correlation is just another word for “relationship.�� For example, how much you weigh is related (correlated) to how much you eat. There are two type of correlation: positive correlation and negative correlation. If data points make a line from the origin from low x and y values to high x and y values the data points are positively correlated, like in the above graph. If the graph starts off with high y-values and continues to low y-values then the graph is negatively correlated. You can think of positive correlation as something that produces a positive result. For example, the more you exercise, the better your cardiovascular health. “Positive” doesn’t necessarily mean “good”! More smoking leads to more chance of cancer and the more you drive, the more likely you are to be in a car accident.
0 notes
Text
DESTINATION : Regression Metrics
ROUTE :  Mean Absolute Error (MAE) - MAE is the average of the difference between the Original Values and the Predicted Values (known as residuals). It shows how far the predictions are from the actual output. Small MAE suggests the model is great at prediction, while a large MAE suggests that your model may have trouble in certain areas. It doesn’t indicate the direction of error – under predicting or over predicting. 
Tumblr media
Mean Squared Error (MSE) - MSE takes average of square of the difference between the original values and predicted values (known as residuals). It is easy to compute the gradient. As we take square of the error, the effect of large errors becomes more pronounced than smaller errors.
Tumblr media
Example :  # Evaluate model using MSE metric 
from sklearn.metrics import mean_squared_error y_pred = model.predict(X_test) mse = mean_squared_error(y_test,y_pred) print(f"Mean Squared Error : {mse:0.2f}")
R Squared (R2 Score) - R Squared is also known as coefficient of determination, represented by R2 or r2 and pronounced as R Squared. It is the number indicating the variance in the dependent variable that is to be predicted from the independent variable. R-squared is a statistical measure of how close the data are to the fitted regression line. R-squared = (Variation of mean  - Variation from fit line) /  Variation of mean. More specifically, R-squared gives you the percentage variation in y explained by x-variables. The range is 0 to 1 (i.e. 0% to 100% of the variation in y can be explained by the x-variables. When it is 0.91, it means there is 91% relationship between dependent and independent variables.
Tumblr media
Example : # Evaluate model using R Squared 
from sklearn.metrics import r2_score y_pred = model.predict(X_test) r2score = r2_score(y_test,y_pred) print(f"R2 Score: {r2score:0.2f}")
0 notes
Text
DESTINATION : Training, Validation and Test Datasets
ROUTE : Training Dataset - The sample of data used to fit the model or the actual dataset that we use to train the model. The model sees and learns from this data.
Validation Dataset -  The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
The validation set is used to evaluate a given model, but this is for frequent evaluation. We as machine learning engineers use this data to fine-tune the model hyperparameters. Hence the model occasionally sees this data, but never does it “Learn” from this. We use the validation set results and update higher level hyperparameters. So the validation set in a way affects a model, but indirectly.
Test Dataset : The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.
The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained(using the train and validation sets). The test set is generally what is used to evaluate competing models (For example on many Kaggle competitions, the validation set is released initially along with the training set and the actual test set is only released when the competition is about to close, and it is the result of the the model on the Test set that decides the winner). Many a times the validation set is used as the test set, but it is not good practice. The test set is generally well curated. It contains carefully sampled data that spans the various classes that the model would face, when used in the real world.
A visualisation of splits
Tumblr media
0 notes
Text
DESTINATION : Linear Regression
ROUTE :  Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.
For more information, click on the link given below:       https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86
0 notes
Text
DESTINATION : Regression Plot
ROUTE :  Regression plots as the name suggests creates a regression line between 2 parameters and helps to visualize their linear relationships.
Tumblr media
0 notes
Text
DESTINATION: Box Plot
ROUTE : Box Plot is the visual representation of the depicting groups of numerical data through their quartiles. 
Boxplot is also used for detecting the outlier in data set. It captures the summary of the data efficiently with a simple box and whiskers and allows us to compare easily across groups. 
Boxplot summarizes a sample data using 25th, 50th and 75th percentiles. These percentiles are also known as the lower quartile, median and upper quartile. A box plot consist of 5 things : Minimum, First Quartile or 25%, Median (Second Quartile) or 50%, Third Quartile or 75%, Maximum.
Tumblr media Tumblr media
0 notes
Text
DESTINATION : Histogram
ROUTE : A histogram is an accurate graphical representation and is an excellent tool for visualizing and understanding the probabilistic distribution of numerical data or image data( since the images are nothing but a combination of picture elements (pixels) ranging from 00 to 255255) that is intuitively understood by almost everyone.
The x-axis of the histogram denotes the number of bins while the y-axis represents the frequency of a particular bin. The number of bins is a parameter which can be varied based on how you want to visualize the distribution of your data. 
Tumblr media
1 note · View note