wonbindatascience - Tumblr blog

wonbindatascience · 5 years ago

Text

Clustering

K-means

https://scikit-learn.org/stable/modules/clustering.html#k-means

This algorithm requires the number of clusters to be specified.

The K-means algorithm aims to choose centroids(mean values) that minimise the inertia, or within-cluster sum-of-squares criterion:

Note that centroids are not, in general, points from X, although they live in the same space.

Inertia can be recognized as a measure of how internally coherent clusters are.

Inertia suffers from various drawbacks:

nertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.

The algorithm has three steps.

The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset X.

After initialization, K-means consists of looping between the two other steps. The first step assigns each sample to its nearest centroid.

The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed and the algorithm repeats these last two steps until this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.

Advantages and disadvantages

https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages

Advantages

Relatively simple to implement.

Scales to large data sets.

Guarantees convergence.

Can warm-start the positions of centroids.

Easily adapts to new examples.

Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Disadvantages

Choosing manually.

Being dependent on initial values.

Clustering data of varying sizes and density.

Clustering outliers.

Scaling with number of dimensions.

Evaluation

https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a

Elbow method

Elbow method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids.

(The graph below shows that k=2 is not a bad choice.)

DBSCAN

https://scikit-learn.org/stable/modules/clustering.html#dbscan

Clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped.

There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

While the parameter min_samples primarily controls how tolerant the algorithm is towards noise (on noisy and large data sets it may be desirable to increase this parameter),

the parameter eps is crucial to choose appropriately for the data set and distance function and usually cannot be left at the default value.

0 notes

wonbindatascience · 5 years ago

Text

AdaBoost vs Gradient boosting

https://www.quora.com/What-is-the-difference-between-gradient-boosting-and-adaboost

Both are boosting algorithms which means that they convert a set of weak learners into a single strong learner. They both initialize a strong learner (usually a decision tree) and iteratively create a weak learner that is added to the strong learner. They differ on how they create the weak learners during the iterative process.

At each iteration, adaptive boosting changes the sample distribution by modifying the weights attached to each of the instances. It increases the weights of the wrongly predicted instances and decreases the ones of the correctly predicted instances. The weak learner thus focuses more on the difficult instances. After being trained, the weak learner is added to the strong one according to his performance (so-called alpha weight). The higher it performs, the more it contributes to the strong learner.

On the other hand, gradient boosting doesn’t modify the sample distribution. Instead of training on a newly sample distribution, the weak learner trains on the remaining errors (so-called pseudo-residuals) of the strong learner. It is another way to give more importance to the difficult instances. At each iteration, the pseudo-residuals are computed and a weak learner is fitted to these pseudo-residuals. Then, the contribution of the weak learner (so-called multiplier) to the strong one isn’t computed according to his performance on the newly distribution sample but using a gradient descent optimization process. The computed contribution is the one minimizing the overall error of the strong learner.

0 notes

wonbindatascience · 5 years ago

Text

DBMS(Database Management System) Terms

What are Keys?

https://www.guru99.com/dbms-keys.html

A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a relation(table).

What is an Entity?

https://www.tutorialspoint.com/dbms/er_model_basic_concepts.htm

An entity can be a real-world object, either animate or inanimate, that can be easily identifiable. For example, in a school database, students, teachers, classes, and courses offered can be considered as entities. All these entities have some attributes or properties that give them their identity.

What is ER(Entity Relationship) model?

https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model

An ER model becomes an abstract data model, that defines a data or information structure which can be implemented in a database, typically a relational database.

What is ER Diagram(ERD)?

https://www.lucidchart.com/pages/er-diagrams

An Entity Relationship (ER) Diagram is a type of flowchart that illustrates how “entities” such as people, objects or concepts relate to each other within a system.

ER Diagrams are most often used to design or debug relational databases in the fields of software engineering, business information systems, education and research.

0 notes

wonbindatascience · 5 years ago

Text

Principal Component Analysis (PCA)

https://ko.wikipedia.org/wiki/%EC%A3%BC%EC%84%B1%EB%B6%84_%EB%B6%84%EC%84%9D

주성분 분석(Principal component analysis; PCA)은 고차원의 데이터를 저차원의 데이터로 환원시키는 기법이다.

서로 연관 가능성이 있는 고차원 공간의 표본들을 선형 연관성이 없는 저차원 공간(주성분)의 표본으로 변환하기 위해 직교 변환을 사용한다. 주성분의 차원수는 원래 표본의 차원수보다 작거나 같다.

주성분 분석은 데이터를 한개의 축으로 사상시켰을 때 그 분산이 가장 커지는 축을 첫 번째 주성분, 두 번째로 커지는 축을 두 번째 주성분으로 놓이도록 새로운 좌표계로 데이터를 선형 변환한다.

이와 같이 표본의 차이를 가장 잘 나타내는 성분들로 분해함으로써 여러가지 응용이 가능하다. 이 변환은 첫째 주성분이 가장 큰 분산을 가지고, 이후의 주성분들은 이전의 주성분들과 직교한다는 제약 아래에 가장 큰 분산을 갖고 있다는 식으로 정의되어있다. 중요한 성분들은 공분산 행렬의 고유 벡터이기 때문에 직교하게 된다.

0 notes

wonbindatascience · 5 years ago

Text

Variance

What is Variance?

https://en.wikipedia.org/wiki/Variance

Informally, it measures how far a set of (random) numbers are spread out from their average value.

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

0 notes

wonbindatascience · 5 years ago

Text

Bias

What is Bias?

https://en.wikipedia.org/wiki/Bias_of_an_estimator

In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated.

0 notes

wonbindatascience · 5 years ago

Text

Random Forest

Process

https://scikit-learn.org/stable/modules/ensemble.html#forest

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a random sample drawn with replacement from the training set.

Why random sampling with replacement?

https://stats.stackexchange.com/questions/447630/why-do-we-use-random-sample-with-replacement-while-implementing-random-forest

It is a theoretical foundation showing that sampling with replacement and then building an ensemble reduces the variance of the forest without increasing the bias. The same theoretical property is not true if you sample without replacement, because sampling without a replacement would lead to pretty high variance.

0 notes

wonbindatascience · 5 years ago

Text

Pandas Tutorial

10 minutes to pandas

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

import

import numpy as np

import pandas as pd

Object creation

s = pd.Series([1, 3, 5, np.nan, 6, 8])

dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

df2 = pd.DataFrame({'A': 1., 'B': pd.Timestamp('20130102'), 'C': pd.Series(1, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'foo'})

Viewing data

df.head()

df.tail(3)

df.index

df.columns

df.to_numpy()

df.describe()

df.T

Transposing your data

df.sort_index(axis=1, ascending=False)

Sorting by an axis

df.sort_values(by='B')

Selection

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

We recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Not recommended

Selecting columns

df['A']

df[['A', ‘B’]]

Selecting rows

df[:3]

df['20130102':'20130104']

Recommended

Quick intro

df.loc[(for row)]

df.loc[(for row), (for column)]

df.iloc[(for row)]

df.iloc[(for row), (for column)]

Selection by label

pandas.DataFrame.loc

Access a group of rows and columns by label(s) or a boolean array.

e.g.

df.loc[dates[0]]

df.loc[:, ['A', 'B']]

df3.loc['20200606':'20200608', 'B':'C']

pandas.DataFrame.at

Access a single value for a row/column label pair.

Similar to loc, but faster

e.g.

df.at[’20200606’, ‘A’]

Selection by position

pandas.DataFrame.iloc

Purely integer-location based indexing for selection by position.

e.g.

df.iloc[3]

df.iloc[3:5, 0:2]

pandas.DataFrame.iat

Access a single value for a row/column pair by integer position.

Similar to iloc, but faster

e.g.

df.iat[3,2]

Boolean indexing

What does indexing means?

https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame.

df[df > 0]

df[df['A'] > 0]

df2[df2['E'].isin(['two', 'four'])]

The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses

Setting

Setting a new column

It automatically aligns the data by the indexes.

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) s1 = pd.Series([1, 2, 3, 4, 5, 6], index= dates) df['F'] = s1

Setting values by label

df3.at['20200605','A'] = 0

df.loc[:, 'D'] = np.array([5] * len(df))

Setting values by position

df.iat[0, 1] = 0

Setting values with where operation

df2[df2 > 0] = -df2

Missing data

pandas primarily uses the value np.nan to represent missing data.

To drop any rows that have missing data

df1.dropna(how='any')

Filling missing data

df1.fillna(value=5)

values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} df1.fillna(value=values)

To get the boolean mask where values are nan.

pd.isna(df1)

Operations

Stats

df.mean()

df.mean(1)

Same operation on the other axis

Apply

df.apply(np.cumsum)

df.apply(lambda x: x.max() - x.min())

Histogramming

s = pd.Series(np.random.randint(0, 7, size=10)) s.value_counts()

String Methods

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s.str.lower()

Merge

Concat

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging

Concatenating pandas objects together with concat()

pd.concat(df1, df2, df3)

<=> df1.append([df2 , df3])

A useful shortcut to concat() are the append() instance methods on Series and DataFrame.

Join

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join

SQL style merges

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

pd.merge(left=df1, right=df2, how=‘left’, on='key')

DataFrame.merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='_x', '_y', copy=True, indicator=False, validate=None)

df1(right=df2, how=‘inner’, on='key')

Grouping

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby

“group by” involving one of the following steps

Splitting the data into groups based on some criteria

Applying a function to each group independently

Combining the results into a data structure

df.groupby('A').sum()

df.groupby(['A', 'B']).sum()

Reshaping

Stack

Pivot tables

Time series

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries

pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)

years = pd.period_range('2010-01-01', '2015-01-01', freq='A') years.asfreq('M', how='S')

Categoricals

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']}) df["grade"] = df["raw_grade"].astype("category") df["grade"].cat.categories = ["very good", "good", "very bad"]

Series.cat()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.html

Accessor object for categorical properties of the Series values.

s.cat.categories

s.cat.categories = list('abc')

s.cat.rename_categories( 'cba’) s.cat.rename_categories({'a': 'A', ’b’:’B’, 'c': 'C'}) s.cat.rename_categories(lambda x: x.upper())

and so on

Plotting

https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#visualization

import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) ts.cumsum().plot()

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D']) plt.figure() df.cumsum().plot() plt.legend(loc='best')

Getting data in/out

df.to_csv('foo.csv')

pd.read_csv('foo.csv')

Gotchas

Intro to data structures

https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dsintro

Series

pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

Series is a one-dimensional labeled array(ndarray) capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)

numpy.ndarray

An array object represents a multidimensional, homogeneous array of fixed-size items.

DataFrame

pandas.DataFrame(data=None, index: Optional[Collection] = None, columns: Optional[Collection] = None, dtype: Union[str, numpy.dtype, ExtensionDtype, None] = None, copy: bool = False)

Parameters

data

ndarray (structured or homogeneous), Iterable, dict, or DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

0 notes

wonbindatascience · 5 years ago

Text

MySQL Workbench Shortcuts on Mac

https://dev.mysql.com/doc/workbench/en/wb-keys.html

Execute statements

Modifier+Shift+Return

Comment/Uncomment lines of SQL

Modifier+/

Beautify Query

Modifier+B

New Tab

Modifier+T

Close Tab

Modifier+W

Switch the tabs

Ctrl+Tab

0 notes

wonbindatascience · 5 years ago

Text

Python itertools

import itertools

https://www.geeksforgeeks.org/python-itertools/

is a module that provides various functions that work on iterators to produce complex iterators

Combinatoric iterators

Quick examples

https://docs.python.org/3/library/itertools.html

.product(*iterables, repeat=1)

https://www.geeksforgeeks.org/python-itertools-product/

This method computes the cartesian product of input iterables.

Cartesian Product of two sets is defined as the set of all ordered pairs (a, b) where a belongs to A and b belongs to B.

Arguments

product(arr, repeat=3)

means the same as product(arr, arr, arr).

product(arr1, arr2, arr3).

e.g.)

from itertools import product print(list(product(['C', 'B', 'A'], ['2', '1']))) # output: [('C', '2'), ('C', '1'), ('B', '2'), ('B', '1'), ('A', '2'), ('A', '1')]

.permutations(iterable, r)

https://www.geeksforgeeks.org/python-itertools-permutations/

This method generates all possible permutations of an iterable.

All elements are treated as unique based on their position and not their values. <=> As understood by the word “Permutation” it refers to all the possible combinations in which a set or string can be ordered or arranged.

Arguments

r

length of permutation needed

e.g.)

from itertools import permutations print(list(permutations(['C', 'B', '1'], r=2))) # output: [('C', 'B'), ('C', '1'), ('B', 'C'), ('B', '1'), ('1', 'C'), ('1', 'B')]

.combinations(iterable, r)

https://www.geeksforgeeks.org/python-itertools/

This method prints all the possible combinations(without replacement)

Arguments

r

length of combination needed

e.g.)

from itertools import combinations print(list(combinations(['C', 'B', '1'], r=2))) # output: [('C', 'B'), ('C', '1'), ('B', '1')]

.combinations_with_replacement(iterable, r)

https://www.geeksforgeeks.org/python-itertools-combinations_with_replacement/

e.g.)

from itertools import combinations_with_replacement print(list(combinations_with_replacement(['C', 'B', '1'], r=2))) # output: [('C', 'C'), ('C', 'B'), ('C', '1'), ('B', 'B'), ('B', '1'), ('1', '1')]

0 notes

wonbindatascience · 5 years ago

Text

!= operation vs “is not” in Python

Rationale: Two objects have the exact same data, but are not identical. (They are not the same object in memory.) Example: Strings

>>> greeting = "It's a beautiful day in the neighbourhood." >>> a = unicode(greeting) >>> b = unicode(greeting) >>> a is b False >>> a == b True https://stackoverflow.com/questions/2209755/python-operation-vs-is-not

0 notes

wonbindatascience · 5 years ago

Text

Function vs Method in Python

A method is a type of function. The simplest function is a free function, it is not attached to a class. A method is a member function.

Functions

Function is block of code that is also called by its name. (independent)

The function can have different parameters or may not have any at all. If any data (parameters) are passed, they are passed explicitly.

It may or may not return any data.

Function does not deal with Class and its instance concept.

Method

Method is called by its name, but it is associated to an object (dependent).

A method is implicitly passed the object on which it is invoked.

It may or may not return any data.

A method can operate on the data (instance variables) that is contained by the corresponding class

https://www.geeksforgeeks.org/difference-method-function-python/ https://www.quora.com/Whats-the-difference-between-a-method-and-function-in-Python https://data-flair.training/blogs/python-method-and-function/

0 notes

wonbindatascience · 5 years ago

Text

Python Code Formatter and Beautifier

https://codebeautify.org/python-formatter-beautifier#

https://github.com/psf/black

#python

0 notes

wonbindatascience · 5 years ago

Text

How to Convert ipynb file to py file

Method 1

Converting a ipynb file to a py file by ‘nbconvert’ Python module

Code

!jupyter nbconvert --to script my_julia_notebook.ipynb

You can execute Terminal commands in the notebook cells by prepending an exclamation point/bang( ! ) to the beginning of the command.

‘nbconvert’ Python module is for converting notebooks to other formats

This is going to convert a ipynb file to a py file which has the same name in the current working directory.

Method 2 (Preferred)

Writing the contents of the cell to a file by a IPython magic function

Magic functions

IPython has a set of predefined ‘magic functions’ that you can call with a command line style syntax.

http://ipython.org/ipython-doc/dev/interactive/tutorial.html#magic-functions

Code

%%writefile filename.py

write this at the beginning of the cell and run it.

https://stackoverflow.com/questions/37797709/convert-json-ipython-notebook-ipynb-to-py-file https://nbconvert.readthedocs.io/en/latest/usage.html#convert-script https://nbconvert.readthedocs.io/en/latest/index.html https://blueriver97.tistory.com/45 http://ipython.org/ipython-doc/dev/interactive/magics.html#cellmagic-writefile

0 notes

wonbindatascience · 5 years ago

Text

Probability Distributions

Discrete probability distributions

Poisson distribution

The Poisson distribution is popular for modeling the number of times an event occurs in an interval of time or space.

e.g.)

The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare.

The number of meteorites greater than 1 meter diameter that strike Earth in a year

The number of patients arriving in an emergency room between 10 and 11 pm

The number of laser photons hitting a detector in a particular time interval

Probability mass function (PMF)

k is the number of times an event occurs in an interval and k can take values 0, 1, 2, ....

λ is the average number of events per interval

Expectation and variance

Assumptions

The events occur with a known constant mean rate,

and independently of the time since the last event.

Continuous probability distributions

Exponential distribution

The Exponentional distribution is popular for modeling the time between events in a process in which events occur continuously and independently at a constant average rate (<=> a Poisson point process).

e.g.)

The time until a radioactive particle decays, or the time between clicks of a Geiger counter

The time it takes before your next telephone call

The time until default (on payment to company debt holders) in reduced form credit risk modeling

Probability density function (PDF)

x is the time between events

λ is the average number of events per interval

Expectation and variance

Memoryless property

if T is conditioned on a failure to observe the event over some initial period of time s, the distribution of the remaining waiting time is the same as the original unconditional distribution.

For example, if an event has not occurred after 30 seconds, the conditional probability that occurrence will take at least 10 more seconds is equal to the unconditional probability of observing the event more than 10 seconds after the initial time.

Poisson point process

is a sequence of events occurring over time where these time intervals are independent random variables having exponential distributions with parameter λ.

The expected waiting time between two events in a Poisson process is 1/λ (= the mean of an exponential distribution)

The expected number of events occurring within a fixed time interval of length t is λt (= the mean of the Poisson distribution)

https://en.wikipedia.org/wiki/Poisson_distribution https://en.wikipedia.org/wiki/Exponential_distribution

0 notes

wonbindatascience · 5 years ago

Text

Confusion Matrix

A table that represents the performance of an algorithm in binary classification.

{The above semicircle: Actual positive, The below semicircle: Actual negative}

Accuracy

One simple way of measuring Accuracy is simply the proportion of individuals who were correctly classified–the proportions of True Positives and True Negatives.

Sensitivity and Specificity

Why sensitivity and specificity? (issues of accuracy)

Accuracy is helpful for sure, but sometimes it matters whether we’re correctly getting a Positive or a Negative correct. It may be worth annoying a few customers to make sure no thieves get away.

Another issue is we can generally increase one simply by decreasing the other. This may have important implications but the overall Accuracy rate won’t change.

Or worse, we could improve overall Accuracy just by making the test more able to find the more common category.

So a better approach is to look at the accuracy for Positives and Negatives separately. These two values are called Sensitivity and Specificity.

Sensitivity, recall, hit rate, or true positive rate (TPR)

= TP/P

‘P’: actual positive (=TP+FN)

Specificity, selectivity or true negative rate (TNR)

= TN/N

‘N’: actual negative (=TN+FP)

Precision or positive predictive value (PPV)

= TP/(TP+FP)

https://www.edureka.co/blog/interview-questions/data-science-interview-questions/ https://en.wikipedia.org/wiki/Confusion_matrix https://www.theanalysisfactor.com/sensitivity-and-specificity/ https://en.wikipedia.org/wiki/Precision_and_recall

0 notes

wonbindatascience · 5 years ago

Text

Data Scientist Interview Questions

BASIC DATA SCIENCE

Q1. What is Data Science? List the differences between supervised and unsupervised learning.

Data Science is a multi-disciplinary field that is related to statistics, data analysis and machine learning to discover hidden patterns and extract knowledge and insights from raw data.

https://en.wikipedia.org/wiki/Data_science

Supervised Learning

Input data is labelled.

Used for prediction (Classification and Regression)

Unsupervised Learning

Input data is unlabelled.

Used for analysis (Classification, Density Estimation, & Dimension Reduction)

Q3. What is bias-variance trade-off? Q17. What are the differences between over-fitting and under-fitting?

bias-variance trade-off

Predictive models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

Bias

is an error from erroneous assumptions in the learning algorithm. (e.g. too simple algorithm)

It can lead to underfitting.

Variance

is an error from sensitivity to small fluctuations in the training set. (e.g. too complex algorithm)

It can lead to overfitting.

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff https://wonbinmachinelearning.tumblr.com/search/overfitting

Q18. How to combat Overfitting and Underfitting?

Underfitting

Try more complex algorithms

Overfitting

Regularization

Q4. What is a confusion matrix?

A table that represents the performance of an algorithm in binary classification.

https://en.wikipedia.org/wiki/Confusion_matrix https://www.theanalysisfactor.com/sensitivity-and-specificity/

{The above semicircle: Actual positive, The below semicircle: Actual negative}

Sensitivity, recall, hit rate, or true positive rate (TPR)

= TP/P

‘P’: actual positive (=TP+FN)

Specificity, selectivity or true negative rate (TNR)

= TN/N

‘N’: actual negative (=TN+FP)

Precision or positive predictive value (PPV)

= TP/(TP+FP)

STATISTICS

Q6. What do you understand by the term Normal Distribution?

Bell-shaped

Symmetrical

Unimodal(one mode)

Mean, Mode, and Median are all located in the center

Asymptotic

Q7. What is correlation and covariance in statistics?

Covariance

is the joint variability of two random variables.

If X tends to increase as Y increases, then Cov(X,Y) > 0 If X tends to decrease as Y increases, then Cov(X,Y) < 0

Correlation

is any statistical relationship between two random variables.

If X tends to increase as Y increases, then Corr(X,Y) > 0 If X tends to decrease as Y increases, then Corr(X,Y) < 0

-1<= Corr(X,Y) <=1

https://en.wikipedia.org/wiki/Covariance https://en.wikipedia.org/wiki/Correlation_and_dependence

Q8. What is the difference between Point Estimates and Confidence Interval?

Point Estimation gives us a particular value as an estimate of a population parameter.

Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.

A confidence interval gives us a range of values which is likely to contain the population parameter.

Q9. What is the goal of A/B Testing?

A/B testing is a randomized experiment with two variants, A and B. A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants is more effective.

https://en.wikipedia.org/wiki/A/B_testing

Q10. What is p-value?

In statistical hypothesis testing, the p-value is a probability that help s to determine if the null hypothesis should be rejected.

If the p-value is lower than a predetermined significance level (alpha), then reject the null hypothesis.

https://en.wikipedia.org/wiki/P-value https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8

Q11. In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of an hour?

Q13. A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls?

Q14. A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head?

Q19. What is regularisation? Why is it useful?

Regularization is a technique used to avoid overfitting problem.

https://datanice.github.io/machine-learning-101-what-is-regularization-interactive.html

Methods

Weight decay

Dropout

Early stopping

Batch normalization in NN

Q20. What Is the Law of Large Numbers?

It is a theorem that describes the result of performing the same experiment a large number of times. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate.

Q21. What Are Confounding Variables?

In statistics, a confounder is a variable that influences both the dependent variable and independent variable.

https://en.wikipedia.org/wiki/Confounding

Q25. Explain how a ROC curve works?

Central limit theorem

Let {X1, …, Xn} be a random sample of size n—that is, a sequence of independent and identically distributed (i.i.d.) random variables with a mean µ and a variance σ2

For large enough n(>30), the distribution of the sample mean Sn is close to the normal distribution with mean µ and variance σ2/n.

DATA ANALYSIS

Q

MACHINE LEARNING

Q

DEEP LEARNING

Q

https://www.edureka.co/blog/interview-questions/data-science-interview-questions/

0 notes