Lesson 2

Learning rate finder

W hat you are looking for is the strongest downward slope that is sticking around for quite a while. Not really looking for bumps. Always test which ones work best.

Fixing the noise in the data

What if Google Image search doesn't give you the right images every time.

Combining a human with the machine is the best way to go.

We're going to use the most mistake method and check for noise in the data. These are the datapoints that might be missed labeled.

Cleaning up

top_losses() - returns de top images that were the worse, and the index of the images. It's going to return the whole dataset that is sorted.

Also, every dataset in fastai has a x and a y. So if we pass idx to our x, it's gonna give us the images in the dataset (usually in our validation dataset) they the model wasn't sure about. In our particular case we're using our valid_ds. You would also re-run all the steps with our training and test set.

We can then use the FileDeleter(file_paths=top_loss_paths). Then you can delete the images that didn't work.

FileDeleter uses a GUI - see the link for more examples. Not good for productionizing because it's just in the note book, it's good for other practictionners. For productionizing you need to build a production webapp.

inference - you have your trained model and you are predicting things. You'll want to use CPU for inference in production, unless you have huge amounts of visitors in which case you have a lot more problems on top of that.

open_image() to open an image...

Production

The example uses Starlette which lets you use await, which allows for ascychonus... so it's not using a process while it's waiting for things.

Freehosting - pythong anywhere.

Errors with training, errors, etc.

Learning_rates, valid and training loss: Not good if your training loss is higher than your valid loss. This means you havn't trained enough. Either your learning rate it too low or not enough epochs. A correct model has a train loss lower than the valid loss. Not good if train loss is WAY HIGHER than valid loss.

Too many epochs - overfitting - model doesn't generalize well.

You are over fitting if the error rate improves for a while and then goes down again.

Learning what train loss epochs, learning rates, etc is

np.argmax - find the highest number and tell me what the index is.

Also, metrics are always going to be applied to the validation set.

checkout matrixmultiplication.xyz

2 things multiplied together + 2 things multiplied together = dot product. When you have lots of those (for example with Yi, and Xi's, that's called a matrix product). so Yi = a1Xi + a2X2i can be rewritten by y=Xa.

The a1, a2's are the coefs, and there's just one of them (no i's).

We can now use pytorch to run y=Xa in one line of code in pytorch.

Unbalanced data

What to do? Nothing. Try it. It always works. If there really arn't a lot, the best thing to do is to get the class that doesn't have a lot of representations and make a lot of copies of it - oversampling - But it's rare that you would need to do that.

Stochastic Gradient Descent

x@a is a matrix multiplication, or vector - matrix, or vector vector, or tensor multiplication.

tensor: it's an array. A 1d array, 2d, 3d, 4d, etc.

Dimension = rank.

if i write : a = tensor (-1.,1) - then write just that 1. makes the whole thing tell python they're all floats (instead of writing (-1.0, 1.0).

How do we fit the data? Stochastic gradient descent. It's almost the same as trying to fit a line to a graphic with a bunch of data.

You want to find parameters (weights) such that you minimize the error between the points and the line x@a. (a unknown). For a regression problem the most common error function or loss function is the mean squared error.

When we get a line, we get a loss, and we want to improve the line slightly. The derivative of the gradient tells us in what direction the move the line. In pytorch this is done with loss.backward(). What happens to the derivatives? It's gets put into and attribute called .grad

def update():
    y_hat = x@a
    loss = ms(y, y_hat)
    if t % 10 == 0: print(loss)
    loss.backward()
    with torch.no_grad():
        a.sub_(lr * a.grad) # going to take the coef a, and subtract (.sub) ou grad.
                            # the _ tells us it's inplace. lr is we only want a tiny step
                            # we have to subtract because we want to oposite of the grad
        a.grad.zero_()

grad: all the derivatives are stored in grad.

SGD vs gradient descent - grabing a batch of size 64 instead of calculating the loss of every single point (or images , or whatever). We grab 64 images AT RANDOM, calculate the loss on those 64 images, and update the weights.

The 64 batch is called a mini batch. That approach is called SGD.

For classification problems we use cross entropy loss, aka negative log likelihood loss. this penalizes inccorect condifent predictions and correct unconfident predictions.

Vocab :

Epoch - you run through all of your data. But each time you see a data point, you run the risk of over fitting. Which si why you don't want to have too many epochs.

sgd: gradient descent using minibatches (which are random points).

Parameters = weights = coefficients

Regularization : all the techniques that make it so when you train your model, it's going to generalize well on data it hasn't seen.

Last updated