Lesson 4 NLP, Collaborative filtering, Embeddings
Last updated
Last updated
In our training data we have 25k instances of whether the review was positive or not. That's clearly not enough data to understand the english language basically.
Until recently NN weren't performing well. With transfer learning it makes it extremely good.
In NLP, the pretrained model we'll use is a language model. The definition of a language model is to predict the next word of a sentence. To do that you can use every single word of your text. If a review has 2k words, that's 2k opportunities to predict the next work.
Also, we build a language model on Wikipedia. It's roughly 1 billion tokens. We get the loss if we get it wrong, we can improve, etc. So at that point we have a model that knows a lot about english and how the world works.
2. Then we create a new language model that's good at predicting movie reviews. (not actually our dataset yet). We'd use any reviews we can find. For all of this training we don't use any labels at all. aka self-supervised learning. The labels are built into the dataset itself. This is called a target corpus.
3. Last step is to use transfer learning on the dataset we're trying to work with now, so the imdb movie classifer. In which case 25k labels and reviews is enough to create a good model.
Using NN for tabular data. Feature engineering becomes simpler, as oposed to when using boosting. A re. You still need some, but less. This means it requires less maintenance.
Jeremy used to use RF 99% of the time for tabular data, now NN he uses it 90%.
the 10% of cases you wouldn't use... you actually would still use them.. try both RF and NN.
categorical variables: we're gonna need a diff approach for categorical variables compared to numerical variables - embeddings.
Continous variables can be sent in the neural net no problem.
Then we have something similar to transformations in vision, but for tabular data, we have things called processes - key difference is that it happens ahead of time. We do it before instead of as we go. In transform we want to randomize it and do it differently each time, but processes we want to do once ahead of time. Here we'll use FillMissing, Categorify (turn into pandas cats), Normalize.
The way fastai deals with missing is fill with median and add a new column that tells you where the missing values were.
Normalization: wtv you do to the training set, you need to do the same thing to the valid set and the test set. Fastai handles this.
Fo ey have to be next to each other, timeseries, videoframes, otherwise it's like cheating. that's why we use split_by_idx.
id and product id - who bought what. You can add review data, time, etc.
You could have it like this but it's going to be enormous because most of the time we'll store it in a sparse matrix. Most users didn't see most movies.
So we store this is a kind of sparse matrix type.
With all this info we can predict whether or not a user will like a movie for example.
Using this in practice is much more tricky. The time you want to be good at recommending a movie is when you have a new user, and the time you care about recommending a movie is when it's a new movie. But when it's new you don't have any data about it. This is called the cold start problem. The only way to solve this is to have a second model that has meta data about new users or new movies.
What netflix used to do - it started showing us a bunch of movies and asking us have you seen this and that, and you'd say yes or no and rate. So Netflix solves this through UX, they used questions to answer the cold start.
The other thing you can do if you can't ask ppl if you like those things, you can use a metadata based tabular model, maybe you know their age and sex, geography.
Collab filtering is when you have a bit of info about your users, the movies, etc.
creates an embedding - trunc_normal_ means it randomizes it (the weights i guess?).
It's a matrix of weights. It's a matrix of weights where you can lookup into and get a weight out of it. You index into it as an array and grab a vector out of it. So in the excel example we have an embedding matrix of users and movies.
Then we do matrix multiplication between the weights, or the dot product. But we want to add something: how popular is this movie in general, and how much does a particular person likes movies in general. These are called bias terms. In SGD we added a column of ones to deal with this, but in practice we explicitly say we want to add a bias term.
So to recap we don't want to have: pred = dot product of 1movie and 1user, we want dot product + a bias term for each.
Here we setup an embedding for the users, for the items, and also the bias vector for both. Then when we calculate the model:
We are just multiplying the two together. dot = ...
Then we add the bias, res =
(put aside the min_score for a sec,) that's what we return.
There's a final tweek at the end: In our case we said there was a min score of 0 and a max score of 5:
we can pass it through this function so that it never goes below 0 or above 5. This means that even though it could have learned this on its own, we're making its life simpler so that it can use moreo f its weights into learning the right preds.
This is the last line of forward: we take the resulf of dotprod + biais, we put it through a sigmoid, and multiply by max_score, min_score it's gonna give you something between max score and min score.
We are multiplying by a one hot encoded matrix but without ever creating it. That's because Creating a OHE matrix out of an array (ex: users (1,2,3,...,15)) then multiplying by weight matrix of size say 15x5, is the same as multiplying that matrix by the index of the array of users 1,2,...,15.
So a one hot encoded matrix like this
multriplied by this weight matrix:
Gives us the same weight matrix.
Here we are finding a particular user's "embedding vector". So indexing that user, and getting all the values. we do the same for a particular movie. Multiplying both (dot product) gives us our prediction!! That's our 4.40 for user 29/id 27.
Here is a matrix multiplication but it's mathematically the same.
parameters are just numbers inside the matrix that you use to multiply. These are the numbers your model learns using gradient descent.
Activations are the results of the matrix multiplication or a non-linear function like relu. They are calculated. (except for inputs)
There are activation functions too (relu) - they are element wise activations - they output the same number of numbers.
You get to chose the size of the columns of the matrix - rows need to match the input, but then you can choose the col size. For the last matrix, you want them the size of the thing you're predicting - numbers = size 10,
Red arrows = layers, and there are only 2 types of layers.
For collaborative filtering - we added a special activation at the end: sigmoid (a scaled sigmoid)
Then we have our loss function: for regression it's MSE or RMSE, for classification we use cross-entropy, and for binary classification softmax is used as an activation function.
Backpropagation: calculating the gradients and subtracting and getting the new weights. backprop: parameters =- parameters.grad - lr
The last weight matrix is size 1000 because they have 1000 categories. For us we can't use it because our problem probably doesn have 1k cats, and they wouldn't be the same cats. So we get ride of it.
When we do create CNN, it deletes it. Instead it puts in 2 new weight matrices for us, with a relu in between. There are default of what size that 1st one is, but the 2nd one can be as big as we need it to be - depending on the number of classes we have or the numbers we're trying to predict. iT's data.c in fastai
We need to train these because the weight matrices are full of random numbers. The other layers are not new and they are good at different things. There are different filters that are good at finding edges, lines, color gradients, etc, and get more and more complex and specific over time.
This means that depending on our dataset, later layers that found eyeballs for ex in resnet aren't good for us because we have no eyeballs in our dataset. However we would have repeating pattersn or lines which are useful. The earlier you go in the model the more likely you want those weights to stay as they are.
So we freeze the earlier part:
It means don't backprop the gradients back to the former layers. Don't update those params.
After a while we want to train the rest of the network. we unfreeze, but we still think that the layers in the end need more trainings and the first ones probably don't need much training. So we give different parts of the model different learning rates.
We can keep training the entire network but the lr from the first layers are gonna move the params less because we think they are pretty good already. If we had used a higher lr we even would have kicked out the good weight values.
This is called using discriminating learning rates.
Fastai gives us 3 layers groups, where these different lr's are applied. The ones we created with random numbers = 1 group, and the other groups are the rest of the layers / 2 (half half).
It means linear function. It means something very close to matrix mul.
When we do matrix multiplication in a NN, for convolutional nn, some of the weights of the filters are tied, so it's more accurate to call them affine functions.
A type of regularization.
In stats, we are trained to think that the best function has less parameters, otherwise we overfit.
You want to use more parameters. That means there will be more interactions, more non linearities, and real life has a lot of that. We don't to have too many interactions and non linearityes.
Let's use lots of params, and the penalize complexity. Let's sum the square of the params, or the absolute value. Taht's why we add this to the MSE for example.
Problem: maybe that sum is so big that the best loss would be to make all param to 0. That's why we choose a parameter alpha that we can multiply those params by.
In fast ai, it's called wd. In ML it's called weight decay. Every learned has a wd argument.
What should that number be? Generally 0.1 works.
When it's in this form (the actual loss function) it's called L2 regularization
In this form it's called weight decay (when we subtract wd-constant * weights from the gradients)
They are kinda mathematically identical, (sometimes they are not).