JulienBeaulieu
  • Introduction
  • Sciences
    • Math
      • Probability
        • Bayes Rule
        • Binomial distribution
        • Conditional Probability
      • Statistics
        • Descriptive Statistics
        • Inferential Statistics
          • Normal Distributions
          • Sampling Distributions
          • Confidence Intervals
          • Hypothesis Testing
          • AB Testing
        • Simple Linear Regression
        • Multiple Linear Regression
          • Statistical learning course
          • Model Assumptions And How To Address Each
        • Logistic Regression
      • Calculus
        • The big picture of Calculus
          • Derivatives
          • 2nd derivatives
          • The exponential e^x
        • Calculus
        • Gradient
      • Linear Algebra
        • Matrices
          • Matrix Multiplication
          • Inverses and Transpose and permutations
        • Vector Space and subspaces
        • Orthogonality
          • Orthogonal Sets
          • Projections
          • Least Squares
        • Gaussian Elimination
    • Programming
      • Command Line
      • Git & GitHub
      • Latex
      • Linear Algebra
        • Element-wise operations, Multiplication Transpose
      • Encodings and Character Sets
      • Uncategorized
      • Navigating Your Working Directory and File I/O
      • Python
        • Problem Solving
        • Strings
        • Lists & Dictionaries
        • Storing Data
        • HTTP Requests
      • SQL
        • Basic Statements
        • Entity Relationship Diagram
      • Jupyter Notebooks
      • Data Analysis
        • Data Visualization
          • Data Viz Cheat Sheet
          • Explanatory Analysis
          • Univariate Exploration of Data
            • Bar Chart
            • Pie Charts
            • Histograms
            • Kernel Density Estimation
            • Figures, Axes, and Subplots
            • Choosing a Plot for Discrete Data
            • Scales and Transformations (Log)
          • Bivariate Exploration of Data
            • Scatterplots
            • Overplotting, Transparency, and Jitter
            • Heatmaps
            • Violin & Box Plots
            • Categorical Variable Analysis
            • Faceting
            • Line Plots
            • Adapted Bar Charts
            • Q-Q, Swarm, Rug, Strip, Stacked, and Rigeline Plots
          • Multivariate Exploration of Data
            • Non-Positional Encodings for Third Variables
            • Color Palettes
            • Faceting for Multivariate Data
            • Plot and Correlation Matrices
            • Other Adaptations of Bivariate PLots
            • Feature Engineering for Data Viz
        • Python - Cheat Sheet
    • Machine Learning
      • Courses
        • Practical Deep learning for coders
          • Convolutional Neural Networks
            • Image Restauration
            • U-net
          • Lesson 1
          • Lesson 2
          • Lesson 3
          • Lesson 4 NLP, Collaborative filtering, Embeddings
          • Lesson 5 - Backprop, Accelerated SGD
          • Tabular data
        • Fast.ai - Intro to ML
          • Neural Nets
          • Business Applications
          • Class 1 & 2 - Random Forests
          • Lessons 3 & 4
      • Unsupervised Learning
        • Dimensionality Reduction
          • Independant Component Analysis
          • Random Projection
          • Principal Component Analysis
        • K-Means
        • Hierarchical Clustering
        • DBSCAN
        • Gaussian Mixture Model Clustering
        • Cluster Validation
      • Preprocessing
      • Machine Learning Overview
        • Confusion Matrix
      • Linear Regression
        • Feature Scaling and Normalization
        • Regularization
        • Polynomial Regression
        • Error functions
      • Decision Trees
      • Support Vector Machines
      • Training and Tuning
      • Model Evaluation Metrics
      • NLP
      • Neural Networks
        • Perceptron Algorithm
        • Multilayer Perceptron
        • Neural Network Architecture
        • Gradient Descent
        • Backpropagation
        • Training Neural Networks
  • Business
    • Analytics
      • KPIs for a Website
  • Books
    • Statistics
      • Practice Statistics for Data Science
        • Exploring Binary and Categorical Data
        • Data and Sampling Distributions
        • Statistical Experiments and Significance Testing
        • Regression and Prediction
        • Classification
        • Correlation
    • Pragmatic Thinking and Learning
      • Untitled
    • A Mind For Numbers: How to Excel at Math and Science
      • Focused and diffuse mode
      • Procrastination
      • Working memory and long term memory
        • Chunking
      • Importance of sleeping
      • Q&A with Terrence Sejnowski
      • Illusions of competence
      • Seeing the bigger picture
        • The value of a Library of Chunks
        • Overlearning
Powered by GitBook
On this page
  • Momemtum
  • RMSProp
  • Dropout
  • Batch Normalization

Was this helpful?

  1. Sciences
  2. Machine Learning
  3. Courses
  4. Practical Deep learning for coders

Lesson 5 - Backprop, Accelerated SGD

PreviousLesson 4 NLP, Collaborative filtering, EmbeddingsNextTabular data

Last updated 5 years ago

Was this helpful?

Momemtum

When you do SGD, for each training instance, instead of multiplying the new weight W_new by the gradient * lr of the last weight, you multiply W_new by 10% of the grad * lr, and by 90% of the last weight (direction of the last weight).

This makes the training go faster.

Before we saw that taking a small lr, it would take us a very long time to converge. But with momentum, you are adding in the step you took last time, so your steps are getting bigger and bigger. Until you overshoot the optimal solution, in which case your gradient is going in the other direction of where your momentum is pointing, so then we'll back the other way.

S(t) = alpha*grad + (1-alpha)S(t-1)

Very common - exponentially weighted moving average.

1-alpha is going to multiply. So S(t-2) is in the equation, with a (1-alpha)^2, S(t-3) is there with (1-alpha)^3.

So it's basically the thing I want (Alpha*g) + a weighted average of the last few time periods, where the most recent ones are exponentially higher weighted.

That's what momentum is. I want the current gradient, + exponentially weighted average of my last few steps.

RMSProp

Dropout

A kind of regularization.

We throw out some activations (and so all the weights associated with that activation will alos be removed). We throw them away with a percentage change p. A common value of p is 0.5. 00

In fastai: ps = dropout for the layers. You can pass in a list and then each p will be the dropout for each list.

For CNN - it's a little different, the int will be on the last layer, and half that value in the other layers.

There a feature in dropout:

training time = weight updates.

At test time we remove dropout: but if we remove it now there's twice as many activations if p was = 0.5. So in the Dropout paper they suggest multiplying all of your weights by p at test time. However Pytorch does it in train time so you don't need to change anything at test time.

Batch Normalization

(bn_cont): BatchNorm1d(122, eps=1e-05, momentum=0.1, 
                       affine=True, track_running_stats=True)

122 = the number of continous variables.

nn.BatchNorm1d - kind of a bit of regualization, kind of a training helper.

Comes from this paper: Batch Normalization: Accelerating Dep Network Training by Reducing Internal Covariate Shift.

Redline: What happens when you train without batch norm (very bumpy). Blue: with.

This means we can increase our leaning rate with BN. The spikes in the red lines show when we're at risk of jumping off into a weight space that we can't get out of.

The algo:

It's going to take a mini batch. Batch Norm is a layer so the thing coming into it is activations. (activations are being called x1, x2, etc..)

  1. Find the mean of the activations

  2. Find the variance

  3. Normalize

  4. Scale and shift (most important part):

    1. We take those values and add a vector of Bias (Beta). So we have a bias layer

    2. Then we use something that looks like a bias and we multiply x_i by it (gamma). It's like having a multiplicative bias layer.

    They are learnable numbers. They are PARAMETERS.

This is what the layer does.

Why does it work well?

Batch norm helps to do this really important thing which is shifting the outputs up and down, in and out.

Explanation: Say we're approximating y with Å·, and we do this with a NN which is represented by f(w1, w2,... w_10000, X_hat). We also have a loss function say MSE.

Let's say we're trying to predict movie review outcomes and they are between 1 and 5.

We've tried to train our model and our activations at the very end are -1, 1. So they are way off where they need to be. The mean and range isn't what we want, we wanted 1-5.

So with batch norm : we're multiplying the neural net function by g, and adding b. We added 2 more parameter vectors

Now we can play with the scale with g, and affect the mean with b. Batch norm helps to do this really important thing which is shifting the outputs up and down, in and out.

You definitely want to use it.

Implementation:

Apply a BatchNorm layer.

(bn_cont): BatchNorm1d(122, eps=1e-05, momentum=0.1, 
                       affine=True, track_running_stats=True)

momentum = 0.1. This isn't momentum like in optimization. This is momentum as in exponentially weighted moving average. We're actually taking the EWMA of the mean in the algo above, not the actual average of every minibatch.

The higher the momentum in batch norm, the more we have a regularization effect.