JulienBeaulieu
  • Introduction
  • Sciences
    • Math
      • Probability
        • Bayes Rule
        • Binomial distribution
        • Conditional Probability
      • Statistics
        • Descriptive Statistics
        • Inferential Statistics
          • Normal Distributions
          • Sampling Distributions
          • Confidence Intervals
          • Hypothesis Testing
          • AB Testing
        • Simple Linear Regression
        • Multiple Linear Regression
          • Statistical learning course
          • Model Assumptions And How To Address Each
        • Logistic Regression
      • Calculus
        • The big picture of Calculus
          • Derivatives
          • 2nd derivatives
          • The exponential e^x
        • Calculus
        • Gradient
      • Linear Algebra
        • Matrices
          • Matrix Multiplication
          • Inverses and Transpose and permutations
        • Vector Space and subspaces
        • Orthogonality
          • Orthogonal Sets
          • Projections
          • Least Squares
        • Gaussian Elimination
    • Programming
      • Command Line
      • Git & GitHub
      • Latex
      • Linear Algebra
        • Element-wise operations, Multiplication Transpose
      • Encodings and Character Sets
      • Uncategorized
      • Navigating Your Working Directory and File I/O
      • Python
        • Problem Solving
        • Strings
        • Lists & Dictionaries
        • Storing Data
        • HTTP Requests
      • SQL
        • Basic Statements
        • Entity Relationship Diagram
      • Jupyter Notebooks
      • Data Analysis
        • Data Visualization
          • Data Viz Cheat Sheet
          • Explanatory Analysis
          • Univariate Exploration of Data
            • Bar Chart
            • Pie Charts
            • Histograms
            • Kernel Density Estimation
            • Figures, Axes, and Subplots
            • Choosing a Plot for Discrete Data
            • Scales and Transformations (Log)
          • Bivariate Exploration of Data
            • Scatterplots
            • Overplotting, Transparency, and Jitter
            • Heatmaps
            • Violin & Box Plots
            • Categorical Variable Analysis
            • Faceting
            • Line Plots
            • Adapted Bar Charts
            • Q-Q, Swarm, Rug, Strip, Stacked, and Rigeline Plots
          • Multivariate Exploration of Data
            • Non-Positional Encodings for Third Variables
            • Color Palettes
            • Faceting for Multivariate Data
            • Plot and Correlation Matrices
            • Other Adaptations of Bivariate PLots
            • Feature Engineering for Data Viz
        • Python - Cheat Sheet
    • Machine Learning
      • Courses
        • Practical Deep learning for coders
          • Convolutional Neural Networks
            • Image Restauration
            • U-net
          • Lesson 1
          • Lesson 2
          • Lesson 3
          • Lesson 4 NLP, Collaborative filtering, Embeddings
          • Lesson 5 - Backprop, Accelerated SGD
          • Tabular data
        • Fast.ai - Intro to ML
          • Neural Nets
          • Business Applications
          • Class 1 & 2 - Random Forests
          • Lessons 3 & 4
      • Unsupervised Learning
        • Dimensionality Reduction
          • Independant Component Analysis
          • Random Projection
          • Principal Component Analysis
        • K-Means
        • Hierarchical Clustering
        • DBSCAN
        • Gaussian Mixture Model Clustering
        • Cluster Validation
      • Preprocessing
      • Machine Learning Overview
        • Confusion Matrix
      • Linear Regression
        • Feature Scaling and Normalization
        • Regularization
        • Polynomial Regression
        • Error functions
      • Decision Trees
      • Support Vector Machines
      • Training and Tuning
      • Model Evaluation Metrics
      • NLP
      • Neural Networks
        • Perceptron Algorithm
        • Multilayer Perceptron
        • Neural Network Architecture
        • Gradient Descent
        • Backpropagation
        • Training Neural Networks
  • Business
    • Analytics
      • KPIs for a Website
  • Books
    • Statistics
      • Practice Statistics for Data Science
        • Exploring Binary and Categorical Data
        • Data and Sampling Distributions
        • Statistical Experiments and Significance Testing
        • Regression and Prediction
        • Classification
        • Correlation
    • Pragmatic Thinking and Learning
      • Untitled
    • A Mind For Numbers: How to Excel at Math and Science
      • Focused and diffuse mode
      • Procrastination
      • Working memory and long term memory
        • Chunking
      • Importance of sleeping
      • Q&A with Terrence Sejnowski
      • Illusions of competence
      • Seeing the bigger picture
        • The value of a Library of Chunks
        • Overlearning
Powered by GitBook
On this page
  • Early Stopping and Model complexity graph
  • Dropout
  • Local Minima, Random Restart and momemtum
  • Stochastic Gradient Descent

Was this helpful?

  1. Sciences
  2. Machine Learning
  3. Neural Networks

Training Neural Networks

PreviousBackpropagationNextAnalytics

Last updated 5 years ago

Was this helpful?

Early Stopping and Model complexity graph

This determines the number of epochs we should do.

We do gradient descent until the testing error stops decreasing and starts to increase. at that moment we stop.

Exercise

The model on the right will generate large errors, so it's going to be difficult for the model to tune and correct them. The function is very steep so it's hard to do gradient descent. The derivatives will all be close to 0 except for the middle part where they'll be very large.

So how do we prevent this type of overfitting from happenning? We have to tweak the error function a little bit. We want to punish the high coefs. We take the old error function and add a term. 2 options:

Lambda param will tell us how much we want to penalize the large coefs.

Small weights will tend to go to 0. If we want to reduce our weights and end up with a small set - L1. It's also good for feature selection as L1 will help us determine which ones are important.

L2 tries to keep weights reasonably small. Normally better for training models.

Why? Taking the squared error of 0.5^2, 0.5^2 will give 0.5. Thus, L2 will prefer the vector (0.5, 0.5) over the vector (1,0) because it produces a smaller number

Dropout

Sometimes part of the network doesn't train because its weights are less important. We can turn off part of the network to allow the neglected part to train. We do this by randomly turning off the nodes as we pass through the epochs.

We can give the algo a parameter, which is the prob that a node will get turned off during an epoch.

Local Minima, Random Restart and momemtum

If we hit a local minimal what can we do? Gradient descent alone will not help us. One way to help is random restart, and do gradient descent on all of them.

Momemtum:

Momemtum is a constant Beta that varies between 0 and 1 and attaches to the previous steps of our gradient descent.

Stochastic Gradient Descent

We take small subsets of data run them through the NN and calculate the gradient based on those points and take a step in that direction. We still want to use all of our data. We do this batch by batch.