masthead1-blog - Tumblr blog

masthead1-blog · 8 years ago

Text

deeplearning - Improving Deep Neural Networks

EFFICIENT DATA SPLIT

A good practice is to split your entire data into 3 parts, namely:

Train set

Development set (also called hold-out cross validation set)

Test set

BIAS VARIANCE TRADEOFF

Errors on the above sets gives us an estimate on bias and variance. Optimal error is usually the benchmark to compare the bias generated from train set. In most cases, optimal error is the error for human eye.

If train error is comparable to optimal error, then bias of the model is minimal. If train error is close to dev-set error, this gives a hint of generalization and hence variance is less.

Bias Variance Statistics Model High Low Underfitting Weak Low High Overfitting Not Generalized Low Low Perfect Best High High Underfitting and overfitting Worst

INITIALIZATION

A well chosen initialization can:

Speed up the convergence of gradient descent

Increase the odds of gradient descent converging to a lower training (and generalization) error

For randomly initialized weights, the cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(a[3] )=log(0) , the loss goes to infinity.

Bench-marking different initialization methods:

Model Train accuracy Problem/Comment 3-layer NN with zeros initialization 50% fails to break symmetry 3-layer NN with large random initialization 83% too large weights 3-layer NN with He initialization 99% recommended method

HE initialization uses a scaling factor of sqrt(2./layers_dims[l-1]). Meanwhile, Xavier's initialization uses a scaling factor of sqrt(1./layers_dims[l-1])

GRADIENT CHECKING

Steps to perform Gradient Checking:

Put all the parameters in a giant vector Θ, and compute derivatives of all weights/parameters to put it to dΘ.

Compute for every i, dΘ[i] using the limit theorem and find euclidean distance with the original dΘ computed above.

If the distance is not every large, then algorithm runs perfectly (in the order of 10e-7)

Use this only when debugging the code, and this doesn't work with dropout.

http://ufldl.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization

REGULARIZATION

The value of λ is a hyperparameter that you can tune using a dev set.

L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to "oversmooth", resulting in a model with high bias.

What is L2-regularization actually doing?:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. Weights end up smaller ("weight decay")

DROPOUT

The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time. Apply dropout both during forward and backward propagation, and drop the same nodes during both propagation in one iteration.

Common mistakes:

Adding dropout to input layer and output layer. You should use dropout only on the middle layers

Using dropout for both training and testing. You should use dropout (randomly eliminate nodes) only in training

Statistics:

model train accuracy test accuracy 3-layer NN without regularization 95% 91.5% 3-layer NN with L2-regularization 94% 93% 3-layer NN with dropout 93% 95%

MINI BATCH GRADIENT DESCENT

Gradient steps with respect to all m examples on each step is called Batch Gradient Descent. In Mini-batch gradient descent, a single pass through the training set (i.e one epoch) allows you to take (num_chunks) gradient descent steps.

Think of mini-batch gradient descent as a baby step of batch gradient descent.

In each mini-batch gradient descent, the trend is downwards but it has some noise with it. It makes sense, since Batch(1) might be well resonating with output, but Batch(2) has some contrasting outputs.

Mini-batch size is a hyper-parameter, which leads to 3 gradient descents:

Batch size Name Advantages/ Disadvantages m Batch Gradient Descent Too long per iteration 1 Stochastic Gradient Descent

Loose speedup from vectorization

Never converge to minimum

somewhere in between Mini-batch Gradient Descent

Takes advantage of vectorization.

Not too long per iteration

GRADIENT DESCENT WITH MOMENTUM

Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent. This method almost always works better than straight-forward gradient descent algorithm.

Compute exponentially weighted moving average of gradients, and use those gradients to update the weights (in the weight update step)

The ideal state of gradient descent would be to slower learning in the vertical step and faster learning in the horizontal step.

(If β = 0, then it becomes standard gradient descent without momentum)

How do you choose β?

The larger the momentum β is, the smoother the update because the more we take the past gradients into account. But if β is too big, it could also smooth out the updates too much.

Common values for β range from 0.8 to 0.999. If you don't feel inclined to tune this, β=0.9 is often a reasonable default.

RMSPROP

Suppose we have contour such as above, where red dot signifies the minimum cost. The zig-zag pattern shown is the path traversed by straight forward gradient descent. In that case, most learning is happening in the vertical direction and less learning is happening in the horizontal direction, which is not the ideal case as highlighted above.

Now, let parameter w be on x-axis (horizontal) and b be on y-axis (vertical).

In the horizontal direction, slope is less, so dW is small, Sdw is small, and weight update of W becomes big. Therefore, you're advancing horizontally very fast Similarly, in the vertical direction, slope is more, db is big, Sdb is big, and weight update of b becomes small. Therefore, vertical direction gets damped out

Combining RMSProp and Momentum results in Adam optimization algorithm.

ADAM OPTIMIZATION ALGORITHM

Adam Optimization algorithm = Gradient Descent with Momentum + RMSprop + Bias correction + Zero correction

ADAM = Adaptive moment estimation

LEARNING RATE DECAY

Usually, learning rate should be high when the model training starts, so that gradient descent can take quick steps. But should be low when the gradient descent starts to converge, if not it will keep bouncing around the minima.

One way to solve this issue is by using a method called Learning Rate decay. This method decays the learning rate as the number of epochs increase. Below are some ways to do it:

On the moon dataset, here are some statistics:

Optimization method Accuracy Cost shape Gradient descent 79.7% oscillations Momentum 79.7% oscillations Adam 94%

smoother

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations in the cost come from the fact that some minibatches are more difficult than others for the optimization algorithm.

Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. Other two models return good accuracy when trained for longer epochs. However, Adam converges a lot faster.

TENSORFLOW

Code for finding the best parameters corresponding to lowest cost

tensorflow.py

import numpy as np import tensorflow as tf W = tf.Variable(0, dtype = float32) cost = tf.add(tf.add(W**2, tf.multiply(-10,W)), 25) train = tf.train.GradientDescentOptimizer(0.01).minimize(cost) with tf.Session() as sess: sess.run(init) for i in range(1000): sess.run(train) print(sess.run(W)) # This is the parameter after cost function optimized 1000 times

A placeholder is an object whose value you can specify only later. To specify values for a placeholder, you can pass in values by using a "feed dictionary"

tensorflow_1.py

import numpy as np import tensorflow as tf coeff = np.array([[1],[-20],[25]]) W = tf.Variable(0, dtype = float32) x = tf.placeholder(tf.float32, [3,1]) cost = x[0][0]*W**2 + x[1][0]*W + x[2][0] train = tf.train.GradientDescentOptimizer(0.01).minimize(cost) with tf.Session() as sess: sess.run(init) for i in range(1000): sess.run(train, feed_dict = {x: coefficients}) print(sess.run(W))

Writing and running programs in TensorFlow has the following steps:

Create Tensors (variables) that are not yet executed/evaluated.

Write operations between those Tensors.

Initialize your Tensors.

Create a Session.

Run the Session. This will run the operations you'd written above.

When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. The computation graph can have some placeholders whose values you will specify only later. Finally, when you run the session, you are telling TensorFlow to execute the computation graph.

A SOFTMAX layer generalizes SIGMOID to when there are more than two classes

HYPERPARAMETER TUNING

Random sampling works better than a grid search. Why?

Say HP1 is very important but HP2 is not. In grid search, after 25 iterations we would have checked only 5 distinct values of HP1 whereas in random search, we would've checked 25 distinct values of HP1

Coarse to fine search process

At the beginning of the run, start off with coarse values, find the area which has good accuracy. Next, zoom in to that region, set hyper-parameters to fine values. In the later iterations, this method focuses more into useful range of HPs

Choose HPs on log scale. Not on linear scale Say we are finding best HP for learning rate (α). When we are around lower end of the scale, the sensitivity of results is very high. The algorithm should use more resources to find HPs in the high sensitivity region rather than spending on low sensitivity regions (higher values of alpha). Logarithmic scale samples more densely in the regime when alpha is on lower end of the scale. This is an efficient way to distribute the samples to explore the space of possible outcomes more efficiently.

Two approaches for hyperparameter tuning in practice:

Pandas approach: Computation resources are very low. Babysit the model with different HPs everyday and track the reduction in error

Caviar approach: Computational resources are huge. Run many models parallely and find the best one possible

COVARIATE SHIFT

Most supervised machine learning techniques are built on the assumption that data at the training and production stages follow the same distribution.

Distributions of inputs (queries) change but the conditional distribution of outputs (answers) is unchanged. Distribution of the inputs used as predictors (covariates) changes between training and prediction stages. This is normally due to changes in state of latent variables, which could be temporal (even changes to the stationarity of a temporal process)

BATCH NORMALIZATION

Normalizing the input features X can help learning a neural network. Batch norm applies that normalization process to the values deep in hidden layers of a NN. This normalizes the mean and variance of hidden layer's values (Z).

Steps to implement gradient descent with batch normalization:

Compute forward propagation on input mini-batch. In each hidden layer, use batch normalization to convert Z to Z(tilda)

Use backprop to compute dW, dΒ, dΓ

Update respective weights with gradients W, b and Γ

Use any optimization algorithm such as momentum, RMSProp and Adam

Input normalization makes all features in the input X on the same scale. Say there are two input features in X, one ranges from 1....10 and other ranges from 1....1000. Normalizing the input makes both features on the same scale, and makes cost function not to be an ellipse, but concentric circles. This makes it easier to find the minima. Batch normalization does similar thing for values in hidden units

Batch-normalization helps overcome covariante shift by making weights deeper in the neural network more robust to changes to weights earlier in the neural network. If the distribution of input changes, the mean and variance of hidden layers will be the same. This makes layers deep in the network more robust, since it sees data which has similar mean and variance.

Batch Normalization limits the amount to which updating the parameters in the earlier layers (input too) can effect the distribution of values that deeper layers now see. It weakens the coupling between the earlier layer parameters and later layer parameters, which forces each layer to learn by itself.

Each mini-batch is scaled by the mean/variance computed on just that mini-batch. This adds some noise to the values Z[l] within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations. This has a slight regularization effect.

During training, we calculate mean and variance over a mini-batch of input data.

During train time, we use a mini-batch of training examples but during test time, we use only one test example at a time. Calculating mean and variance over a single test example doesn't make sense. To combat this issue, estimate mean and variance using exponentially weighted averages on train set.

#deeplearning.ai #andrewng #neuralnetworks #coursera

0 notes

masthead1-blog · 8 years ago

Text

Convolution Neural Networks

Introduction

Sounds like a weird combination of biology and math with a little CS sprinkled in, but these networks have been some of the most influential innovations in the field of computer vision. The classic and arguably most popular use case of these networks is for image processing, and recently applied to Natural Language Processing

Background

The first successful applications of ConvNets was by Yann LeCun in the 90’s, he created something called LeNet, that could be used to read hand written number (source: giphy)

In 2010 the Stanford Vision Lab released ImageNet. Image net is data set of 14 million images with labels detailing the contents of the images.

The first viable example of a CNN applied to Image was AlexNet in 2012

The Problem Space

Image classification is the task of taking an input image and outputting a class (a cat, dog, etc) or a probability of classes that best describes the image. So, this turns out to be in Supervised Classification space. The whole network expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other

(source: sourcedexter)

Inputs and Outputs

Unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width x height x depth. Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point.

Complete model

(source: clarifai)

Structure

We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer. We will stack these layers to form a full ConvNet architecture. We'll take the example of CIFAR-10 for better understanding.

INPUT

INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.

CONV (MATH PART)

Imagine a flashlight that is shining over the top left of the image. Let’s say that the light this flashlight shines covers a 5 x 5 area. Now, let’s imagine this flashlight sliding across all the areas of the input image. In machine learning terms, this flashlight is called a filter or neuron or kernel and the region that it is shining over is called the receptive field. The filter is an array of numbers, where the numbers are called weights or parameters. The filter is randomly initialized at the start, and is learnt overtime by the network. NOTE : Depth of this filter has to be the same as the depth of the input (this makes sure that the math works out), so the dimensions of this filter is 5 x 5 x 3. (source: Andrej Karpathy)

Lets take the first position of the filter for example, it would be at the top left corner. As the filter is sliding, or convolving, around the input image, it is multiplying the values in the filter with the original pixel values of the image (aka computing element wise multiplications).

Element wise multiplication : Filter and the receptive field in this example are (5 x 5 x 3) respectively, which has 75 multiplications in total. These multiplications are all summed up to have a single number. Remember, this number is just representative of when the filter is at the top left of the image. Now, we repeat this process for every location on the input volume. (Next step would be moving the filter to the right by 1 unit, then right again by 1, and so on). Every unique location on the input volume produces a number. (source: Andrej Karpathy)

After sliding the filter over all locations, we are left with 28 x 28 x 1 array of numbers, which are called the activation map or feature map. (source: Andrej Karpathy)

Now, we will have an entire set of filters in each CONV layer (e.g. 6 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume ( 28 x 28 x 6) (source: Andrej Karpathy)

Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network

CONV (High Level Perscpective)

Let’s say our first filter is 7 x 7 x 3 and is going to be a curve detector. As a curve detector, the filter will have a pixel structure in which there will be higher numerical values along the area that is a shape of a curve(source: adeshpande)

When we have this filter at the top left corner of the input volume, it is computing multiplications between the filter and pixel values at that region. Now let’s take an example of an image that we want to classify, and let’s put our filter at the top left corner.

(source: adeshpande)

Basically, in the input image, if there is a shape that generally resembles the curve that this filter is representing, then all of the multiplications summed together will result in a large value! Now let’s see what happens when we move our filter.

(source: adeshpande)

The value is much lower! This is because there wasn’t anything in the image section that responded to the curve detector filter. This is just a filter that is going to detect lines that curve outward and to the right. We can have other filters for lines that curve to the left or for straight edges. The more filters, the greater the depth of the activation map, and the more information we have about the input volume.

Now when you apply a set of filters on top of previous activation map (pass it through the 2nd conv layer), the output will be activations that represent higher level features. Types of these features could be semicircles (combination of a curve and straight edge) or squares (combination of several straight edges). As you go through the network and go through more CONV layers, you get activation maps that represent more and more complex features. By the end of the network, you may have some filters that activate when there is handwriting in the image, filters that activate when they see pink objects, etc.

(source: Andrej Karpathy)

FULLY CONNECTED LAYER

This layer basically takes an input volume (whatever the output is of the CONV or ReLU or POOL layer preceding it) and outputs an N dimensional vector where N is the number of classes that the program has to choose from. For example, if you wanted a digit classification program, N would be 10 since there are 10 digits. Each number in this N dimensional vector represents the probability of a certain class.

Training

FORWARD PASS

Take a training image which as we remember is a 32 x 32 x 3 array of numbers and pass it through the whole network. On our first training example, since all of the weights or filter values were randomly initialized, the output will probably be something like [.1 .1 .1 .1 .1 .1 .1 .1 .1 .1], basically an output that doesn’t give preference to any number in particular. The network, with its current weights, isn’t able to look for those low level features or thus isn’t able to make any reasonable conclusion about what the classification might be.

LOSS FUNCTION

Let’s say for example that the first training image inputted was a 3. The label for the image would be [0 0 0 1 0 0 0 0 0 0]. A loss function can be defined in many different ways but a common one used in classification is Cross Entropy often called as LogLoss.

As you can imagine, the loss will be extremely high for the first couple of training images. Now, let’s just think about this intuitively. We want to get to a point where the predicted label (output of the ConvNet) is the same as the training label (This means that our network got its prediction right). In order to get there, we want to minimize the amount of loss we have. Visualizing this as just an optimization problem in calculus, we want to find out which inputs (weights in our case) most directly contributed to the loss (or error) of the network.

This is the mathematical equivalent of a dL/dW where W are the weights at a particular layer.

BACKWARD PASS

Perform backward pass through the network, which is determining which weights contributed most to the loss and finding ways to adjust them so that the loss decreases.

WEIGHT UPDATE

We take all the weights of the filters and update them so that they change in the opposite direction of the gradient.

A high learning rate means that bigger steps are taken in the weight updates and thus, it may take less time for the model to converge on an optimal set of weights. However, a learning rate that is too high could result in jumps that are too large and not precise enough to reach the optimal point.

The process of forward pass, loss function, backward pass, and parameter update is one training iteration. The program will repeat this process for a fixed number of iterations for each set of training images (commonly called a batch). Once you finish the parameter update on the last training example, hopefully the network should be trained well enough so that the weights of the layers are tuned correctly.

Hyperparameters

STRIDE

The amount by which the filter shifts is the stride. Stride is normally set in a way so that the output volume is an integer and not a fraction.

Let’s look at an example. Let’s imagine a 7 x 7 input volume, a 3 x 3 filter and a stride of 1. Stride of 2 :

The receptive field is shifting by 2 units now and the output volume shrinks as well. Notice that if we tried to set our stride to 3, then we’d have issues with spacing and making sure the receptive fields fit on the input volume.

PADDING

Motivation: What happens when you apply three 5 x 5 x 3 filters to a 32 x 32 x 3 input volume? The output volume would be 28 x 28 x 3. Notice that the spatial dimensions decrease. As we keep applying CONV layers, the size of the volume will decrease faster than we would like. In the early layers of our network, we want to preserve as much information about the original input volume so that we can extract those low level features. If we want to apply the same CONV layer, but we want the output volume to remain 32 x 32 x 3 ? Zero-padding comes to the rescue

Zero padding pads the input volume with zeros around the border.

The formula for calculating the output size for any given CONV layer is

where O is the output height/length, W is the input height/length, K is the filter size, P is the padding, and S is the stride

QUIZ TIME

Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: ? Number of parameters in this layer?

Activation Functions Cheat Sheet

Rectified Linear Unit (ReLU)

After each CONV layer, it is convention to apply a nonlinear layer (or activation layer) immediately afterward.The purpose of this layer is to introduce non-linearity to a system that basically has just been computing linear operations during the CONV layers (just element wise multiplications and summations).

It also helps to alleviate the vanishing gradient problem, which is the issue where the lower layers of the network train very slowly because the gradient decreases exponentially through the layers

ReLU layer applies the function f(x) = max(0, x) to all of the values in the input volume. In basic terms, this layer just changes all the negative activations to 0.

Rectified Linear Units Improve Restricted Boltzmann Machines

Pooling Layers

It is also referred to as a down-sampling layer. In this category, there are also several layer options, with max-pooling being the most popular. This basically takes a filter (normally of size 2x2) and a stride of the same length. It then applies it to the input volume and outputs the maximum number in every subregion that the filter convolves around.

(source: Andrej Karpathy)

Other options for pooling layers are average pooling and L2-norm pooling.

The intuitive reasoning behind this layer is that once we know that a specific feature is in the original input volume (there will be a high activation value), its exact location is not as important as its relative location to the other features. As you can imagine, this layer drastically reduces the spatial dimension (the length and the width change but not the depth) of the input volume. This serves two main purposes. The first is that the amount of parameters or weights is reduced by 75%, thus lessening the computation cost. The second is that it will control overfitting.

Dropout Layers

This layer “drops out” a random set of activations in that layer by setting them to zero.

What are the benefits of such a simple and seemingly unnecessary and counterintuitive process? It forces the network to be redundant. The network should be able to provide the right classification or output for a specific example even if some of the activations are dropped out. It makes sure that the network isn’t getting too “fitted” to the training data and thus helps alleviate the overfitting problem

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Network in Network Layers (1x1 convolution)

Now, at first look, you might wonder why this type of layer would even be helpful since receptive fields are normally larger than the space they map to. However, we must remember that these 1x1 convolutions span a certain depth, so we can think of it as a 1 x 1 x N convolution where N is the number of filters applied in the layer.

Dimensionality reduction:

Height and Width : Max-pooling

Depth : 1x1 convolution

Network In Network

Brain/ Neuron view of CONV layer

Suppose we have an input of 32 x 32 x 3 and we convolve a filter of size 5 x 5 x 3, we get the below picture

An activation map is a 28 x 28 sheet of neuron outputs where in :

Each is connected to a small region in the input

All of them share parameters

We convolve 5 filters of size 5x5x3 and get 28x28x5 output. Each neuron shares parameters with its siblings in the same filter, but does't share parameters across the depth (other filters)

But each neuron across the depth of the activation map looks at the same receptive field in the input, but have different parameters/filters.

Case Study

There are several architectures in the field of Convolutional Networks that have a name. The most common are:

LeNet. The first successful applications of Convolutional Networks were developed by Yann LeCun in 1990’s. Of these, the best known is the LeNet architecture that was used to read zip codes, digits, etc.

AlexNet. The first work that popularized Convolutional Networks in Computer Vision was the AlexNet, developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton. The AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and significantly outperformed the second runner-up (top 5 error of 16% compared to runner-up with 26% error). The Network had a very similar architecture to LeNet, but was deeper, bigger, and featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer always immediately followed by a POOL layer).

ZF Net. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller.

GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. There are also several followup versions to the GoogLeNet, most recently Inception-v4.

VGGNet. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the VGGNet. Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. Their pretrained model is available for plug and play use in Caffe. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters (140M). Most of these parameters are in the first fully connected layer, and it was since found that these FC layers can be removed with no performance downgrade, significantly reducing the number of necessary parameters.

ResNet. Residual Network developed by Kaiming He et al. was the winner of ILSVRC 2015. It features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network. The reader is also referred to Kaiming’s presentation (video, slides), and some recent experiments that reproduce these networks in Torch. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 10, 2016). In particular, also see more recent developments that tweak the original architecture from Kaiming He et al. Identity Mappings in Deep Residual Networks (published March 2016).

References

https://www.youtube.com/watch?v=GYGYnspV230&index=7&list=PL16j5WbGpaM0_Tj8CRmurZ8Kk1gEBc7fg (archived video)

http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

http://cs231n.github.io/convolutional-networks/

http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

0 notes

masthead1-blog · 8 years ago

Text

word2vec

Word2Vec uses a trick you may have seen elsewhere in machine learning. We’re going to train a simple neural network with a single hidden layer to perform a certain task, but then we’re not actually going to use that neural network for the task we trained it on! Instead, the goal is actually just to learn the weights of the hidden layer–we’ll see that these weights are actually the “word vectors” that we’re trying to learn.

NOTE : Another place you may have seen this trick is in unsupervised feature learning, where you train an auto-encoder to compress an input vector in the hidden layer, and decompress it back to the original in the output layer. After training it, you strip off the output layer (the decompression step) and just use the hidden layer--it's a trick for learning good image features without having labeled training data.

We’re going to train the neural network to do the following. Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.

NOTE: "nearby" = "window size" parameter to the algorithm. A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total).

The output probabilities are going to relate to how likely it is to find each vocabulary word nearby our input word. For example, if you gave the trained network the input word “Soviet”, the output probabilities are going to be much higher for words like “Union” and “Russia” than for unrelated words like “watermelon” and “kangaroo”.

We’ll train the neural network to do this by feeding it word pairs found in our training documents.

The network is going to learn the statistics from the number of times each pairing shows up. So, for example, the network is probably going to get many more training samples of (“Soviet”, “Union”) than it is of (“Soviet”, “Sasquatch”). When the training is finished, if you give it the word “Soviet” as input, then it will output a much higher probability for “Union” or “Russia” than it will for “Sasquatch”.

Model Details

Input is a one-hot vector where it has 1 in the input word place, and 0 everywhere. The output of the network is a single vector (also with 10,000 components) containing, for every word in our vocabulary, the probability that a randomly selected nearby word is that vocabulary word.

There is no activation function on the hidden layer neurons, but the output neurons use softmax.

When training this network on word pairs, the input is a one-hot vector representing the input word and the training output is also a one-hot vector representing the output word. But when you evaluate the trained network on an input word, the output vector will actually be a probability distribution

Architecture:

Input : 1 x 10000 = 1 x (vocabulary_size)

Hidden : 10000 x 300 = (vocabulary_size) x (number_of_features)

Output : 300 x 10000 = (number_of_features) x (vocabulary_size)

So the end goal of all of this is really just to learn this hidden layer weight matrix – the output layer we’ll just toss when we’re done!

This means that the hidden layer of this model is really just operating as a lookup table. The output of the hidden layer is just the “word vector” for the input word.

The output layer

The 1 x 300 word vector for “ants” then gets fed to the output layer. The output layer is a softmax regression classifier. Each output neuron (one per word in our vocabulary!) will produce an output between 0 and 1, and the sum of all these output values will add up to 1.

Specifically, each output neuron has a weight vector (300 x 1) which it multiplies against the word vector (1 x 300) from the hidden layer, which results in (1 x 1). Then it applies the function exp(x) to the result. Finally, in order to get the outputs to sum up to 1, we divide this result by the sum of the results from all 10,000 output nodes.

NOTE: Let's say that in our training corpus, every single occurrence of the word 'York' is preceded by the word 'New'. That is, at least according to the training data, there is a 100% probability that 'New' will be in the vicinity of 'York'. However, if we take the 10 words in the vicinity of 'York' and randomly pick one of them, the probability of it being 'New' is not 100%; you may have picked one of the other words in the vicinity

Skip-gram neural network contains a huge number of weights. For our example with 300 features and a vocab of 10,000 words, that’s 3M weights in the hidden layer and output layer each!

To combat these limitations, we have 3 ways:

Treating common word pairs or phrases as single “words” in their model.

Subsampling frequent words to decrease the number of training examples - Remove some words which occur really often

Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights.

Word Pairs and “Phrases”

Each pass of their tool only looks at combinations of 2 words, but you can run it multiple times to get longer phrases. So, the first pass will pick up the phrase “New_York”, and then running it again will pick up “New_York_City” as a combination of “New_York” and “City”.

The tool counts the number of times each combination of two words appears in the training text, and then these counts are used in an equation to determine which word combinations to turn into phrases. The equation is designed to make phrases out of words which occur together often relative to the number of individual occurrences. It also favors phrases made of infrequent words in order to avoid making phrases out of common words like “and the” or “this is”.

Equation

* STEP 2: Decide which word combinations represent phrases. * In this step, we go back through the training text again, and evaluate * whether each word combination should be turned into a phrase. * We are trying to determine whether words A and B should be turned into * A_B. * * The variable 'pa' is the word count for word A, and 'pb' is the count for * word B. 'pab' is the word count for A_B. * * Consider the following ratio: * pab / (pa * pb) * * This ratio must be a fraction, because pab <= pa and pab <= pb. * The fraction will be larger if: * - pab is large relative to pa and pb, meaning that when A and B occur * they are likely to occur together. * - pa and/or pb are small, meaning that words A and B are relatively * infrequent. * * They modify this ratio slightly by subtracting the "min_count" parameter * from pab. This will eliminate very infrequent phrases. The new ratio is * (pab - min_count) / (pa * pb) * * So, if the ratio is greater than 1, it means pab occurs more frequently than pa and pb combined. It makes sense to have * separate vector for the combined word * Finally, this ratio is multiplied by the total number of words in the * training text. Presumably, this has the effect of making the threshold * value more independent of the training set size. */

Subsampling Frequent Words - Removing repetitive words like "the"

There are two “problems” with common words like “the”:

When looking at word pairs, (“fox”, “the”) doesn’t tell us much about the meaning of “fox”. “the” appears in the context of pretty much every word.

We will have many more samples of (“the”, …) than we need to learn a good vector for “the”.

Word2Vec implements a “subsampling” scheme to address this. For each word we encounter in our training text, there is a chance that we will effectively delete it from the text. The probability that we cut the word is related to the word’s frequency.

If we have a window size of 10, and we remove a specific instance of “the” from our text:

As we train on the remaining words, “the” will not appear in any of their context windows.

We’ll have 10 fewer training samples where “the” is the input word.

Note how these two effects help address the two problems stated above.

w is the word, z(w ) is the fraction of number of occurrences of that word by total words in the corpus. There is also a parameter in the code named ‘sample’ which controls how much subsampling occurs, and the default value is 0.001. Smaller values of ‘sample’ mean words are less likely to be kept.

Here are some interesting points in this function (again this is using the default sample value of 0.001).

P(w )=1.0 (100% chance of being kept) when z(w )<=0.0026

This means that only words which represent more than 0.26% of the total words will be subsampled.

P(w )=0.5 (50% chance of being kept) when z(w )=0.00746

P(w )=0.033 (3.3% chance of being kept) when z(w )=1.0 .

That is, if the corpus consisted entirely of word w

Negative Sampling

Training a neural network means taking a training example and adjusting all of the neuron weights slightly so that it predicts that training sample more accurately. In other words, each training sample will tweak all of the weights in the neural network.

As we discussed above, the size of our word vocabulary means that our skip-gram neural network has a tremendous number of weights, all of which would be updated slightly by every one of our billions of training samples! Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them.

When training the network on the word pair (“fox”, “quick”), recall that the “label” or “correct output” of the network is a one-hot vector. That is, for the output neuron corresponding to “quick” to output a 1, and for all of the other thousands of output neurons to output a 0.

With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for. (In this context, a “negative” word is one for which we want the network to output a 0 for). We will also still update the weights for our “positive” word (which is the word “quick” in our current example).

NOTE: The paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets.

Recall that the output layer of our model has a weight matrix that’s 300 x 10,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, making (300 x 6) = 1,800 weight values total. That’s only 0.06% of the 3M weights in the output layer!

In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not).

Selecting Negative Samples

The “negative samples” (that is, the 5 output words that we’ll train to output 0) are chosen using a “unigram distribution”.

Essentially, the probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples.

Each word is given a weight equal to it’s frequency (word count) raised to the 3/4 power. The probability for a selecting a word is just it’s weight divided by the sum of weights for all words.

P ( w i ) = f ( w i ) 3 / 4 ∑ j = 0 n ( f ( w j ) 3 / 4 )

NOTE : The way this selection is implemented in the C code is interesting. They have a large array with 100M elements (which they refer to as the unigram table). They fill this table with the index of each word in the vocabulary multiple times, and the number of times a word’s index appears in the table is given by P(wi ) * table_size. Then, to actually select a negative sample, you just generate a random integer between 0 and 100M, and use the word at that index in the table. Since the higher probability words occur more times in the table, you’re more likely to pick those.

#nlp #word2vec #ai #ml

0 notes

masthead1-blog · 8 years ago

Text

Random Forest

Random forest is an ensemble tool which takes a subset of observations(rows) and subset of variables(columns) to build decision trees

GETTING THEROTICAL

To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes

Each tree is grown as follows:

If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.

If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.

Each tree is grown to the largest extent possible. There is no pruning.

The forest error rate depends on two things:

The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.

The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.

Reducing m reduces both the correlation and the strength. Less m, means the tree has less variables to choose from, and the chances of having the same variables between two trees is very less. More m, means the tree has a wide spectrum of variables to choose from, and the chances of correlation between two trees is very high. Meanwhile, reducing m decreases the tree strength, which increases the error rate, and makes it a weak classifier. Therefore, strength and correlation are inversely proportional to each other Somewhere in between is an "optimal" range of m - usually quite wide. Using the oob error rate a value of m in the range can quickly be found. This is the only adjustable parameter to which random forests is somewhat sensitive.

Features of Random Forests

It runs efficiently on large data bases.

It can handle thousands of input variables without variable deletion.

It gives estimates of what variables are important in the classification.

It generates an internal unbiased estimate of the generalization error as the forest building progresses.

It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

It has methods for balancing error in class population unbalanced data sets.

It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.

The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.

When the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest. It is also used to get estimates of variable importance.

After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of the run, the proximities are normalized by dividing by the number of trees. Proximities are used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data.

The out-of-bag (oob) error estimate

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:

Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases(where case was n) is the oob error estimate. This has proven to be unbiased in many tests.

Variable importance

In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m. Here, correct class in variable-m permuted oob data are the classes which are predicted correctly independent of the variable,m. Therefore, subtracting that from the total correct class in untouched data gives us the number of cases where correct class was predicted because of variable m

If the values of this score from tree to tree are independent, then the standard error can be computed by a standard computation. The correlations of these scores between trees have been computed for a number of data sets and proved to be quite low, therefore we compute standard errors in the classical way, divide the raw score by its standard error to get a z-score, and assign a significance level to the z-score assuming normality.

If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run.

Gini importance

Every time a split of a node is made on variable m the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure.

Interactions

The operating definition of interaction used is that variables m and k interact if a split on one variable, say m, in a tree makes a split on k either systematically less possible or more possible. The implementation used is based on the gini values g(m) for each tree in the forest. These are ranked for each tree and for each two variables, the absolute difference of their ranks are averaged over all trees. This number is also computed under the hypothesis that the two variables are independent of each other and the latter subtracted from the former. A large positive number implies that a split on one variable inhibits a split on the other and conversely.

Proximities

These are one of the most useful tools in random forests. The proximities originally formed a NxN matrix. After a tree is grown, put all of the data, both training and oob, down the tree. If cases k and n are in the same terminal node increase their proximity by one. At the end, normalize the proximities by dividing by the number of trees.

Prototypes

Prototypes are a way of getting a picture of how the variables relate to the classification. For the jth class, we find the case that has the largest number of class j cases among its k nearest neighbors (basically a cluster of cases based on class j), determined using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for each variable. The medians are the prototype for class j and the quartiles give an estimate of is stability. For the second prototype, we repeat the procedure but only consider cases that are not among the original k, and so on.

Missing value replacement for the training set

Random forests has two ways of replacing missing values. The first way is fast. If the mth variable is not categorical, the method computes the median of all values of this variable in class j(filter all cases based on class j, filter based on variable m, and compute the median with the values), then it uses this value to replace all missing values of the mth variable in class j. If the mth variable is categorical, the replacement is the most frequent non-missing value in class j. These replacement values are called fills.

The second way of replacing missing values is computationally more expensive but has given better performance than the first, even with large amounts of missing data. It replaces missing values only in the training set. It begins by doing a rough and inaccurate filling in of the missing values. Then it does a forest run and computes proximities.

If x(m,n) i.e, mth variable, nth case is a missing continuous value, estimate its fill as an average over the non-missing values of the mth variables(of all cases in the training set) weighted by the proximities between the nth case and the non-missing value case. If it is a missing categorical variable, replace it by the most frequent non-missing value in the training set, where frequency is weighted by proximity.

Now iterate-construct a forest again using these newly filled in values, find new fills and iterate again. Our experience is that 4-6 iterations are enough.

Missing value replacement for the test set

When there is a test set, there are two different methods of replacement depending on whether labels exist for the test set.

If they do, then the fills derived from the training set are used as replacements. If labels no not exist, then each case in the test set is replicated nclass times (nclass= number of classes). The first replicate of a case is assumed to be class 1 and the class one fills are used to replace missing values. The 2nd replicate is assumed class 2 and the class 2 fills used on it.

This augmented test set is run down the tree. In each set of replicates, the one receiving the most votes determines the class of the original case.

Outliers

Outliers are generally defined as cases that are removed from the main body of the data. Translate this as: outliers are cases whose proximities to all other cases in the data are generally small. A useful revision is to define outliers relative to their class. Thus, an outlier in class j is a case whose proximities to all other class j cases are small.

Define the average proximity from case n in class j to the rest of the training data class j as:

The raw outlier measure for case n is defined as

This will be large if the average proximity is small. Within each class find the median of these raw measures, and their absolute deviation from the median. Subtract the median from each raw measure, and divide by the absolute deviation to arrive at the final outlier measure.

Unsupervised learning

In unsupervised learning the data consist of a set of x -vectors of the same dimension with no class labels or response variables. There is no figure of merit to optimize, leaving the field open to ambiguous conclusions. The usual goal is to cluster the data - to see if it falls into different piles, each of which can be assigned some meaning.

The approach in random forests is to consider the original data as class 1 and to create a synthetic second class of the same size that will be labeled as class 2. The synthetic second class is created by sampling at random from the univariate distributions of the original data.

Thus, class two has the distribution of independent random variables, each one having the same univariate distribution as the corresponding variable in the original data. Class 2 thus destroys the dependency structure in the original data. But now, there are two classes and this artificial two-class problem can be run through random forests. This allows all of the random forests options to be applied to the original unlabeled data set.

If the oob misclassification rate in the two-class problem is, say, 40% or more, it implies that the x -variables look too much like independent variables to random forests. The dependencies do not have a large role and not much discrimination is taking place. If the misclassification rate is lower, then the dependencies are playing an important role.

Formulating it as a two class problem has a number of payoffs. Missing values can be replaced effectively. Outliers can be found. Variable importance can be measured. Scaling can be performed (in this case, if the original data had labels, the unsupervised scaling often retains the structure of the original scaling). But the most important payoff is the possibility of clustering.

Balancing prediction error

Prediction error occurs usually when one class is much larger than another. Then random forests, trying to minimize overall error rate, will keep the error rate low on the large class while letting the smaller classes have a larger error rate(larger error rate will be more focused during back propagation) For instance, in drug discovery, where a given molecule is classified as active or not, it is common to have the actives outnumbered by 10 to 1, up to 100 to 1. In these situations the error rate on the interesting class (actives) will be very high.

The user can detect the imbalance by outputs of the error rates for individual classes.

There is a low overall test set error (3.73%) but class 2 has over 3/4 of its cases misclassified (which is bad) The error balancing can be done by setting different weights for the classes.The higher the weight a class is given, the more its error rate is decreased. A guide as to what weights to give is to make them inversely proportional to the class populations. Using different weights, the overall error rate goes up. This is the usual result - to get better balance, the overall error rate will be increased.

SOME POINTERS

The individual trees in Random Forest are called weak learners. It builds multiple such decision tree and amalgamate them together to get a more accurate and stable prediction

When there's only one decision tree, there is high variance, because it sees the whole data at once. When multiple trees see subset of data and the prediction being mean/average of all individual predictions, the prediction turns out to be more stable

Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

Training set for the tree is drawn by sampling with replacement

Average when it is regression, Voting mechanism when it is classification

There are many hyper parameters that can be tuned.

max_features: Maximum number of features Random Forest is allowed to try in an individual tree. Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered. However, this is not necessarily true as this decreases the diversity of individual tree which is the USP of random forest. But, for sure, you decrease the speed of algorithm by increasing the max_features. Hence, you need to strike the right balance and choose the optimal max_features

n_estimators: This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable.

min_leaf_size: A smaller leaf makes the model more prone to capturing noise in train data. A bigger leaf makes the model prone to overfitting

Algorithms for finding the best variable to split the node:

Gini impurity

Gini index says, if we select two items from a population at random then they must be of same class. Probability for this is 1 if population is pure. Divide number of people that belong to one class by the total number of people in that population. This gives the probability

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2)

Calculate Gini for split using weighted Gini score of each node of that split

The higher the Gini impurity, the better (homogenous) the split. Gini impurity of 1 = best

Information Gain

Degree of disorganization in a system is known as Entropy

Perfect Split = Lesser Entropy = More information gain

Entropy of 0, means perfect homogeneity.

It chooses the split which has lowest entropy compared to parent node and other splits.

Overfitting is a challange in Decision trees. In training, it will give 100% training accuracy, but it will end up making 1 leaf for every observation.

To overcome overfitting, 2 ways are suggested:

Setting constraints on tree size

Tree pruning:

Let the decision tree grow to a large depth

Then start at the bottom and start removing leaves which are giving negetive results, when compared to the top

This works well in the case: Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we will see that the overall gain is +10 and keep both leaves.

Like every other model, a tree based model also suffers from the plague of bias and variance. Bias means, how much on an average are the predicted values different from the actual value. Variance means, how different will the predictions of the model be at the same point if different samples are taken from the same population.

Bias-variance intuition

Underfitting : High Bias - Low variance Overfitting : Low Bias - High Variance A champion model should maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors.

There are two ways to overcome bias-variance tradeoff. 1. Bagging 2. Boosting

Bagging:

Bagging is a technique used to reduce the variance of our predictions by combining the result of multiple classifiers modeled on different sub-samples of the same data set.

Random Forest uses Bagging technique.

Create Multiple DataSets:

Sampling is done with replacement on the original data and new datasets are formed.

The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model

Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting

Build Multiple Classifiers:

Classifiers are built on each data set.

Generally the same classifier is modeled on each data set and predictions are made.

Combine Classifiers:

The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.

The combined values are generally more robust than a single model.

Ex : Random Forest

Assume number of cases in the training set is N. Then, sample of these N cases is taken at random but with replacement. This sample will be the training set for growing the tree.

If there are M input variables, a number m<M is specified such that at each node, m variables are selected at random out of the M. The best split on these m is used to split the node. The value of m is held constant while we grow the forest.

Each tree is grown to the largest extent possible and there is no pruning.

Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for classification, average for regression).

XGBoost vs GBM

XGBoost supports Parallel processing

XGBoost allow users to define custom optimization objectives and evaluation criteria.

XGBoost has an in-built routine to handle missing values.

A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm. XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.

XGBoost allows user to run a cross-validation at each iteration of the boosting process

Variable Importance:

One of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.

When a certain tree uses one variable and another doesn't, you can compare the value lost or gained from the inclusion/exclusion of that variable.

References

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

https://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined

https://github.com/llSourcell/random_forests

#machinelearning #ai #treemodels #randomforest #ensemble

0 notes