wonbindatascience
wonbindatascience
Wonbin Data Science
197 posts
Machine Learning, Statistics and Programming
Don't wanna be here? Send us removal request.
wonbindatascience · 5 years ago
Text
Clustering
K-means
https://scikit-learn.org/stable/modules/clustering.html#k-means
This algorithm requires the number of clusters to be specified.
The K-means algorithm aims to choose centroids(mean values) that minimise the inertia, or within-cluster sum-of-squares criterion:
Tumblr media
Note that centroids are not, in general, points from X, although they live in the same space.
Inertia can be recognized as a measure of how internally coherent clusters are.
Inertia suffers from various drawbacks:
nertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.
The algorithm has three steps.
The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset X.
After initialization, K-means consists of looping between the two other steps. The first step assigns each sample to its nearest centroid.
The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed and the algorithm repeats these last two steps until this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.
Advantages and disadvantages
https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages
Advantages
Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantages
Choosing  manually.
Being dependent on initial values.
Clustering data of varying sizes and density.
Clustering outliers.
Scaling with number of dimensions.
Evaluation
https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a
Elbow method
Elbow method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids.
(The graph below shows that k=2 is not a bad choice.)
Tumblr media
DBSCAN
https://scikit-learn.org/stable/modules/clustering.html#dbscan
Clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped.
There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.
While the parameter min_samples primarily controls how tolerant the algorithm is towards noise (on noisy and large data sets it may be desirable to increase this parameter),
the parameter eps is crucial to choose appropriately for the data set and distance function and usually cannot be left at the default value.
0 notes
wonbindatascience · 5 years ago
Text
AdaBoost vs Gradient boosting
https://www.quora.com/What-is-the-difference-between-gradient-boosting-and-adaboost
Both are boosting algorithms which means that they convert a set of weak learners into a single strong learner. They both initialize a strong learner (usually a decision tree) and iteratively create a weak learner that is added to the strong learner. They differ on how they create the weak learners during the iterative process.
At each iteration, adaptive boosting changes the sample distribution by modifying the weights attached to each of the instances. It increases the weights of the wrongly predicted instances and decreases the ones of the correctly predicted instances. The weak learner thus focuses more on the difficult instances. After being trained, the weak learner is added to the strong one according to his performance (so-called alpha weight). The higher it performs, the more it contributes to the strong learner.
On the other hand, gradient boosting doesn’t modify the sample distribution. Instead of training on a newly sample distribution, the weak learner trains on the remaining errors (so-called pseudo-residuals) of the strong learner. It is another way to give more importance to the difficult instances. At each iteration, the pseudo-residuals are computed and a weak learner is fitted to these pseudo-residuals. Then, the contribution of the weak learner (so-called multiplier) to the strong one isn’t computed according to his performance on the newly distribution sample but using a gradient descent optimization process. The computed contribution is the one minimizing the overall error of the strong learner.
0 notes
wonbindatascience · 5 years ago
Text
DBMS(Database Management System) Terms
What are Keys?
https://www.guru99.com/dbms-keys.html
A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a relation(table).
What is an Entity?
https://www.tutorialspoint.com/dbms/er_model_basic_concepts.htm
An entity can be a real-world object, either animate or inanimate, that can be easily identifiable. For example, in a school database, students, teachers, classes, and courses offered can be considered as entities. All these entities have some attributes or properties that give them their identity.
What is ER(Entity Relationship) model?
https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model
An ER model becomes an abstract data model, that defines a data or information structure which can be implemented in a database, typically a relational database.
What is ER Diagram(ERD)?
https://www.lucidchart.com/pages/er-diagrams
An Entity Relationship (ER) Diagram is a type of flowchart that illustrates how “entities” such as people, objects or concepts relate to each other within a system.
ER Diagrams are most often used to design or debug relational databases in the fields of software engineering, business information systems, education and research.
0 notes
wonbindatascience · 5 years ago
Text
Principal Component Analysis (PCA)
https://ko.wikipedia.org/wiki/%EC%A3%BC%EC%84%B1%EB%B6%84_%EB%B6%84%EC%84%9D
주성분 분석(Principal component analysis; PCA)은 고차원의 데이터를 저차원의 데이터로 환원시키는 기법이다.
서로 연관 가능성이 있는 고차원 공간의 표본들을 선형 연관성이 없는 저차원 공간(주성분)의 표본으로 변환하기 위해 직교 변환을 사용한다. 주성분의 차원수는 원래 표본의 차원수보다 작거나 같다.
주성분 분석은 데이터를 한개의 축으로 사상시켰을 때 그 분산이 가장 커지는 축을 첫 번째 주성분, 두 번째로 커지는 축을 두 번째 주성분으로 놓이도록 새로운 좌표계로 데이터를 선형 변환한다.
이와 같이 표본의 차이를 가장 잘 나타내는 성분들로 분해함으로써 여러가지 응용이 가능하다. 이 변환은 첫째 주성분이 가장 큰 분산을 가지고, 이후의 주성분들은 이전의 주성분들과 직교한다는 제약 아래에 가장 큰 분산을 갖고 있다는 식으로 정의되어있다. 중요한 성분들은 공분산 행렬의 고유 벡터이기 때문에 직교하게 된다.
0 notes
wonbindatascience · 5 years ago
Text
Variance
What is Variance?
https://en.wikipedia.org/wiki/Variance
Informally, it measures how far a set of (random) numbers are spread out from their average value.
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. 
0 notes
wonbindatascience · 5 years ago
Text
Bias
What is Bias?
https://en.wikipedia.org/wiki/Bias_of_an_estimator
In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated.
0 notes
wonbindatascience · 5 years ago
Text
Random Forest
Process
https://scikit-learn.org/stable/modules/ensemble.html#forest
In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a random sample drawn with replacement from the training set.
Why random sampling with replacement?
https://stats.stackexchange.com/questions/447630/why-do-we-use-random-sample-with-replacement-while-implementing-random-forest
It is a theoretical foundation showing that sampling with replacement and then building an ensemble reduces the variance of the forest without increasing the bias. The same theoretical property is not true if you sample without replacement, because sampling without a replacement would lead to pretty high variance.
0 notes
wonbindatascience · 5 years ago
Text
Pandas Tutorial
10 minutes to pandas
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
import
import numpy as np
import pandas as pd
Object creation
s = pd.Series([1, 3, 5, np.nan, 6, 8])
dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df2 = pd.DataFrame({'A': 1.,                    'B': pd.Timestamp('20130102'),                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),                    'D': np.array([3] * 4, dtype='int32'),                    'E': pd.Categorical(["test", "train", "test", "train"]),                    'F': 'foo'})
Viewing data
df.head()
df.tail(3)
df.index
df.columns
df.to_numpy()
df.describe()
df.T
Transposing your data
df.sort_index(axis=1, ascending=False)
Sorting by an axis
df.sort_values(by='B')
Selection
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
We recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.
Not recommended
Selecting columns
df['A']
df[['A', ‘B’]]
Selecting rows 
df[:3]
df['20130102':'20130104']
Recommended
Quick intro
df.loc[(for row)]
df.loc[(for row), (for column)]
df.iloc[(for row)]
df.iloc[(for row), (for column)]
Selection by label
pandas.DataFrame.loc
Access a group of rows and columns by label(s) or a boolean array.
e.g.
df.loc[dates[0]]
df.loc[:, ['A', 'B']]
df3.loc['20200606':'20200608', 'B':'C']
pandas.DataFrame.at
Access a single value for a row/column label pair.
Similar to loc, but faster
e.g.
df.at[’20200606’, ‘A’]
Selection by position
pandas.DataFrame.iloc
Purely integer-location based indexing for selection by position.
e.g.
df.iloc[3]
df.iloc[3:5, 0:2]
pandas.DataFrame.iat
Access a single value for a row/column pair by integer position.
Similar to iloc, but faster
e.g.
df.iat[3,2]
Boolean indexing
What does indexing means?
https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame.
df[df > 0]
df[df['A'] > 0]
df2[df2['E'].isin(['two', 'four'])]
The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses
Setting
Setting a new column
It automatically aligns the data by the indexes.
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) s1 = pd.Series([1, 2, 3, 4, 5, 6], index= dates) df['F'] = s1
Setting values by label
df3.at['20200605','A'] = 0
df.loc[:, 'D'] = np.array([5] * len(df))
Setting values by position
df.iat[0, 1] = 0
Setting values with where operation
df2[df2 > 0] = -df2
Missing data
pandas primarily uses the value np.nan to represent missing data.
To drop any rows that have missing data
df1.dropna(how='any')
Filling missing data
df1.fillna(value=5)
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} df1.fillna(value=values)
To get the boolean mask where values are nan.
pd.isna(df1)
Operations
Stats
df.mean()
df.mean(1)
Same operation on the other axis
Apply
df.apply(np.cumsum)
df.apply(lambda x: x.max() - x.min())
Histogramming
s = pd.Series(np.random.randint(0, 7, size=10)) s.value_counts()
String Methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s.str.lower()
Merge
Concat
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging
Concatenating pandas objects together with concat() 
pd.concat(df1, df2, df3)
<=> df1.append([df2 , df3])
A useful shortcut to concat() are the append() instance methods on Series and DataFrame.
Join
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join
SQL style merges
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
pd.merge(left=df1, right=df2, how=‘left’, on='key')
DataFrame.merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='_x', '_y', copy=True, indicator=False, validate=None)
df1(right=df2, how=‘inner’, on='key')
Grouping
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby
“group by” involving one of the following steps
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
df.groupby('A').sum()
df.groupby(['A', 'B']).sum()
Reshaping
Stack
Pivot tables
Time series
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries
pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
years = pd.period_range('2010-01-01', '2015-01-01', freq='A') years.asfreq('M', how='S')
Categoricals
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']}) df["grade"] = df["raw_grade"].astype("category") df["grade"].cat.categories = ["very good", "good", "very bad"]
Series.cat()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.html
Accessor object for categorical properties of the Series values.
s.cat.categories
s.cat.categories = list('abc')
s.cat.rename_categories( 'cba’) s.cat.rename_categories({'a': 'A', ’b’:’B’, 'c': 'C'}) s.cat.rename_categories(lambda x: x.upper())
and so on
Plotting
https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#visualization
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) ts.cumsum().plot()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D']) plt.figure() df.cumsum().plot() plt.legend(loc='best')
Getting data in/out
df.to_csv('foo.csv')
pd.read_csv('foo.csv')
Gotchas
Intro to data structures
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dsintro
Series
pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Series is a one-dimensional labeled array(ndarray) capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)
numpy.ndarray
An array object represents a multidimensional, homogeneous array of fixed-size items.
DataFrame
pandas.DataFrame(data=None, index: Optional[Collection] = None, columns: Optional[Collection] = None, dtype: Union[str, numpy.dtype, ExtensionDtype, None] = None, copy: bool = False)
Parameters
data
ndarray (structured or homogeneous), Iterable, dict, or DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
0 notes
wonbindatascience · 5 years ago
Text
MySQL Workbench Shortcuts on Mac
https://dev.mysql.com/doc/workbench/en/wb-keys.html
Execute statements
Modifier+Shift+Return
Comment/Uncomment lines of SQL
Modifier+/
Beautify Query
Modifier+B
New Tab
Modifier+T
Close Tab
Modifier+W
Switch the tabs
Ctrl+Tab
0 notes
wonbindatascience · 5 years ago
Text
Python itertools
import itertools
https://www.geeksforgeeks.org/python-itertools/
is a module that provides various functions that work on iterators to produce complex iterators
Combinatoric iterators
Quick examples
https://docs.python.org/3/library/itertools.html
Tumblr media
.product(*iterables, repeat=1)
https://www.geeksforgeeks.org/python-itertools-product/
This method computes the cartesian product of input iterables.
Cartesian Product of two sets is defined as the set of all ordered pairs (a, b) where a belongs to A and b belongs to B.
Arguments
product(arr, repeat=3)
means the same as product(arr, arr, arr).
product(arr1, arr2, arr3).
e.g.)
from itertools import product print(list(product(['C', 'B', 'A'], ['2', '1']))) # output:  [('C', '2'), ('C', '1'), ('B', '2'), ('B', '1'), ('A', '2'), ('A', '1')]
.permutations(iterable, r)
https://www.geeksforgeeks.org/python-itertools-permutations/
This method generates all possible permutations of an iterable.
All elements are treated as unique based on their position and not their values. <=> As understood by the word “Permutation” it refers to all the possible combinations in which a set or string can be ordered or arranged.
Arguments
r
length of permutation needed
e.g.)
from itertools import permutations     print(list(permutations(['C', 'B', '1'], r=2)))  # output:  [('C', 'B'), ('C', '1'), ('B', 'C'), ('B', '1'), ('1', 'C'), ('1', 'B')]
.combinations(iterable, r)
https://www.geeksforgeeks.org/python-itertools/
This method prints all the possible combinations(without replacement) 
Arguments
r
length of combination needed
e.g.)
from itertools import combinations     print(list(combinations(['C', 'B', '1'], r=2))) # output:  [('C', 'B'), ('C', '1'), ('B', '1')]
.combinations_with_replacement(iterable, r)
https://www.geeksforgeeks.org/python-itertools-combinations_with_replacement/
e.g.)
from itertools import combinations_with_replacement     print(list(combinations_with_replacement(['C', 'B', '1'], r=2))) # output:  [('C', 'C'), ('C', 'B'), ('C', '1'), ('B', 'B'), ('B', '1'), ('1', '1')]
0 notes
wonbindatascience · 5 years ago
Text
!= operation vs “is not” in Python
Rationale: Two objects have the exact same data, but are not identical. (They are not the same object in memory.) Example: Strings
>>> greeting = "It's a beautiful day in the neighbourhood." >>> a = unicode(greeting) >>> b = unicode(greeting) >>> a is b False >>> a == b True https://stackoverflow.com/questions/2209755/python-operation-vs-is-not
0 notes
wonbindatascience · 5 years ago
Text
Function vs Method in Python
A method is a type of function. The simplest function is a free function, it is not attached to a class. A method is a member function.
Functions
Function is block of code that is also called by its name. (independent)
The function can have different parameters or may not have any at all. If any data (parameters) are passed, they are passed explicitly.
It may or may not return any data.
Function does not deal with Class and its instance concept.
Method
Method is called by its name, but it is associated to an object (dependent).
A method is implicitly passed the object on which it is invoked.
It may or may not return any data.
A method can operate on the data (instance variables) that is contained by the corresponding class
https://www.geeksforgeeks.org/difference-method-function-python/ https://www.quora.com/Whats-the-difference-between-a-method-and-function-in-Python https://data-flair.training/blogs/python-method-and-function/
0 notes
wonbindatascience · 5 years ago
Text
Python Code Formatter and Beautifier
https://codebeautify.org/python-formatter-beautifier#
https://github.com/psf/black
0 notes
wonbindatascience · 5 years ago
Text
How to Convert ipynb file to py file
Method 1
Converting a ipynb file to a py file by ‘nbconvert’ Python module
Code
!jupyter nbconvert --to script my_julia_notebook.ipynb
You can execute Terminal commands in the notebook cells by prepending an exclamation point/bang( ! ) to the beginning of the command.
‘nbconvert’ Python module is for converting notebooks to other formats
This is going to convert a ipynb file to a py file which has the same name in the current working directory.
Method 2 (Preferred)
Writing the contents of the cell to a file by a IPython magic function
Magic functions
IPython has a set of predefined ‘magic functions’ that you can call with a command line style syntax.
http://ipython.org/ipython-doc/dev/interactive/tutorial.html#magic-functions
Code
%%writefile filename.py
write this at the beginning of the cell and run it.
Tumblr media
https://stackoverflow.com/questions/37797709/convert-json-ipython-notebook-ipynb-to-py-file https://nbconvert.readthedocs.io/en/latest/usage.html#convert-script https://nbconvert.readthedocs.io/en/latest/index.html https://blueriver97.tistory.com/45 http://ipython.org/ipython-doc/dev/interactive/magics.html#cellmagic-writefile
0 notes
wonbindatascience · 5 years ago
Text
Probability Distributions
Discrete probability distributions
Poisson distribution
The Poisson distribution is popular for modeling the number of times an event occurs in an interval of time or space.
e.g.)
The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare.
The number of meteorites greater than 1 meter diameter that strike Earth in a year
The number of patients arriving in an emergency room between 10 and 11 pm
The number of laser photons hitting a detector in a particular time interval
Probability mass function (PMF)
Tumblr media Tumblr media
k is the number of times an event occurs in an interval and k can take values 0, 1, 2, ....
λ is the average number of events per interval
Expectation and variance
Tumblr media
Assumptions
The events occur with a known constant mean rate,
and independently of the time since the last event.
Continuous probability distributions
Exponential distribution
The Exponentional distribution is popular for modeling the time between events in a process in which events occur continuously and independently at a constant average rate (<=> a Poisson point process).
e.g.)
The time until a radioactive particle decays, or the time between clicks of a Geiger counter
The time it takes before your next telephone call
The time until default (on payment to company debt holders) in reduced form credit risk modeling
Probability density function (PDF)
Tumblr media
x is the time between events
λ is the average number of events per interval
Expectation and variance
Tumblr media Tumblr media
Memoryless property
if T is conditioned on a failure to observe the event over some initial period of time s, the distribution of the remaining waiting time is the same as the original unconditional distribution.
For example, if an event has not occurred after 30 seconds, the conditional probability that occurrence will take at least 10 more seconds is equal to the unconditional probability of observing the event more than 10 seconds after the initial time.
Poisson point process
is a sequence of events occurring over time where these time intervals are independent random variables having exponential distributions with parameter λ.
The expected waiting time between two events in a Poisson process is 1/λ (= the mean of an exponential distribution)
The expected number of events occurring within a fixed time interval of length t is λt (= the mean of the Poisson distribution)
https://en.wikipedia.org/wiki/Poisson_distribution https://en.wikipedia.org/wiki/Exponential_distribution
0 notes
wonbindatascience · 5 years ago
Text
Confusion Matrix
Confusion Matrix
A table that represents the performance of an algorithm in binary classification.
Tumblr media Tumblr media
{The above semicircle: Actual positive, The below semicircle: Actual negative}
Accuracy
One simple way of measuring Accuracy is simply the proportion of individuals who were correctly classified–the proportions of True Positives and True Negatives.
Sensitivity and Specificity
Why sensitivity and specificity? (issues of accuracy)
Accuracy is helpful for sure, but sometimes it matters whether we’re correctly getting a Positive or a Negative correct. It may be worth annoying a few customers to make sure no thieves get away.
Another issue is we can generally increase one simply by decreasing the other. This may have important implications but the overall Accuracy rate won’t change.
Or worse, we could improve overall Accuracy just by making the test more able to find the more common category.
So a better approach is to look at the accuracy for Positives and Negatives separately. These two values are called Sensitivity and Specificity.
Sensitivity, recall, hit rate, or true positive rate (TPR)
= TP/P
‘P’: actual positive (=TP+FN)
Specificity, selectivity or true negative rate (TNR)
= TN/N
‘N’: actual negative (=TN+FP)
Precision or positive predictive value (PPV)
= TP/(TP+FP)
https://www.edureka.co/blog/interview-questions/data-science-interview-questions/ https://en.wikipedia.org/wiki/Confusion_matrix https://www.theanalysisfactor.com/sensitivity-and-specificity/ https://en.wikipedia.org/wiki/Precision_and_recall
0 notes
wonbindatascience · 5 years ago
Text
Data Scientist Interview Questions
BASIC DATA SCIENCE
Q1. What is Data Science? List the differences between supervised and unsupervised learning.
Data Science is a multi-disciplinary field that is related to statistics, data analysis and machine learning to discover hidden patterns and extract knowledge and insights from raw data.
https://en.wikipedia.org/wiki/Data_science
Supervised Learning
Input data is labelled.
Used for prediction (Classification and Regression)
Unsupervised Learning
Input data is unlabelled.
Used for analysis (Classification, Density Estimation, & Dimension Reduction)
Q3. What is bias-variance trade-off? Q17. What are the differences between over-fitting and under-fitting?
bias-variance trade-off
Predictive models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.
Bias
is an error from erroneous assumptions in the learning algorithm. (e.g. too simple algorithm)
It can lead to underfitting.
Variance
is an error from sensitivity to small fluctuations in the training set. (e.g. too complex algorithm)
It can lead to overfitting.
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff https://wonbinmachinelearning.tumblr.com/search/overfitting
Tumblr media
Q18. How to combat Overfitting and Underfitting?
Underfitting
Try more complex algorithms
Overfitting
Regularization
Q4. What is a confusion matrix?
A table that represents the performance of an algorithm in binary classification.
https://en.wikipedia.org/wiki/Confusion_matrix https://www.theanalysisfactor.com/sensitivity-and-specificity/
Tumblr media Tumblr media
{The above semicircle: Actual positive, The below semicircle: Actual negative}
Sensitivity, recall, hit rate, or true positive rate (TPR)
= TP/P
‘P’: actual positive (=TP+FN)
Specificity, selectivity or true negative rate (TNR)
= TN/N
‘N’: actual negative (=TN+FP)
Precision or positive predictive value (PPV)
= TP/(TP+FP)
STATISTICS
Q6. What do you understand by the term Normal Distribution?
Bell-shaped
Symmetrical
Unimodal(one mode)
Mean, Mode, and Median are all located in the center
Asymptotic
Q7. What is correlation and covariance in statistics?
Covariance
is the joint variability of two random variables.
If X tends to increase as Y increases, then Cov(X,Y) > 0 If X tends to decrease as Y increases, then Cov(X,Y) < 0
Correlation
is any statistical relationship between two random variables.
If X tends to increase as Y increases, then Corr(X,Y) > 0 If X tends to decrease as Y increases, then Corr(X,Y) < 0
-1<= Corr(X,Y) <=1
Tumblr media Tumblr media Tumblr media
https://en.wikipedia.org/wiki/Covariance https://en.wikipedia.org/wiki/Correlation_and_dependence
Q8. What is the difference between Point Estimates and Confidence Interval?
Point Estimation gives us a particular value as an estimate of a population parameter.
Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.
A confidence interval gives us a range of values which is likely to contain the population parameter.
Q9. What is the goal of A/B Testing?
A/B testing is a randomized experiment with two variants, A and B. A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants is more effective.
https://en.wikipedia.org/wiki/A/B_testing
Q10. What is p-value?
In statistical hypothesis testing, the p-value is a probability that help s to determine if the null hypothesis should be rejected.
If the p-value is lower than a predetermined significance level (alpha), then reject the null hypothesis.
https://en.wikipedia.org/wiki/P-value https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8
Q11. In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the proba­bility that you see at least one shooting star in the period of an hour?
Q13. A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls?
Q14. A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head?
Q19. What is regularisation? Why is it useful?
Regularization is a technique used to avoid overfitting problem.
https://datanice.github.io/machine-learning-101-what-is-regularization-interactive.html
Methods
Weight decay
Dropout
Early stopping
Batch normalization in NN
Q20. What Is the Law of Large Numbers?
It is a theorem that describes the result of performing the same experiment a large number of times. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate.
Q21.  What Are Confounding Variables?
In statistics, a confounder is a variable that influences both the dependent variable and independent variable.
https://en.wikipedia.org/wiki/Confounding
Tumblr media
Q25. Explain how a ROC curve works?
Central limit theorem
Let {X1, …, Xn} be a random sample of size n—that is, a sequence of independent and identically distributed (i.i.d.) random variables with a mean µ and a variance σ2
For large enough n(>30), the distribution of the sample mean Sn is close to the normal distribution with mean µ and variance σ2/n.
Tumblr media
DATA ANALYSIS
Q
MACHINE LEARNING
Q
DEEP LEARNING
Q
https://www.edureka.co/blog/interview-questions/data-science-interview-questions/
0 notes