Convolutional Neural Networks
Last updated
Last updated
Look to see when you add contrast, brightness, etc, the limit where the picture still represents what it actually is, even with high brightness, or low brightness - that's your interval for Data Aug
Look at the dataset that im modeling - the validation or test set - and look at what types of photos they are. Ex : professional photographies, then you'd want middle brightness, contract, etc, no extremes. If it's amature photos, then maybe some will be over or under exposed.
If you're lookiung at satellite images then flip makes sens, for cats, makes no sense.
Reflection is the best border augmentation, rather than black, or other.
Symmetric wrap - great for simulating an image in different angles.
Data augmentation for genomics? Text? Other ? Great potential for this area of research.
What part of the image did the model look at the most, when deciding what image it is.
Convolution - a kind of matrix multiply, with some different things.
We do an element-wise multiplication of the kernel (9 elements), with each of the 9 pixel values wherever the convolution happens to be. We then add them all up, and that is shown on the right - there's 1 red square, 1 ouput - it's the output of the matrix multiplication and addition. THIS IS CALLED A CONVOLUTION.
Other def:
The feature is the kernel 3x3 above - the convolution is the application of that feature or kernel to every part of the image, - we end up with a map of numbers indicating how well that feature did for each part of the image. All of our high numbers are all along the diagonal, which makes sense. This suggests that our feature or kernel matches along the diagonal much better than any where else.
Doing it for multiple features - kernels, you get different maps. This is why we end up with 16 channels or results, or how ever many we decide (more and more the deeper you go). So we create a stack of filtered images. it's a convolution layer.
Therefore, the size of the image on the right has 1 pixel less than the original - that's the black border. Can't calculate that.
The face has turned into white parts, outlining hozirontal edges. How? By the kernel multiplication. Why is that creating white spots: Ex near a border or edge - above pixels are whiter, so they have high values. They are getting multiplied by 1,2,1. Also, bottom pixels are darker so they have a small value. They are getting multiplied by -1,-2,-1. You end up with a high number - so a white pixel on the right image.
Def Channel - The output of a convolutional field is called a Channel. In the example above we have a channel that found top edges because that's how the kernal was made.
At first, a convolution is able to pick up borders, because that's all the kernel*pixel multriplication + addition can do. That's why the 1st layer does this. But the next layer, can take the result of this, combine one channel (a convolutional "result").
It can take one channel that found top edges. and another that finds left edges, then the layer after can take the two and find something that finds top left corners.
The same four weights are being moved around. The output is the result of some linear function.
This is the network view - the same image - abcdefghj. We multiply all the inputs by a weight to get P, Q, R, S. But some weights, the grey ones, are 0. P is only connected to ABDE.
In other words, in remembering that the graph represents a matrix multiplication, we can represent the convolution as a matrix multiplication (top left). As we can see from the matrix top left, a lot of the values in the matrix are 0.
The colors in the matrix have the same weight. When things have the same weight it's called weight tying.
In practice, our libraries have specific convolutional functions that we use. We don't use matrix multiplication because it's slow.
If we have a 3x3 kernel, and a 3x3 image, we need to add padding because if not, it can only produce one pixel of output, it can't move around. So what libraries do is add a padding of 0's.
With padding, now we can move our kernel around and get the same output size as we started with.
In fastai, they use reflexion padding, not 0 padding.
If we have 3 layers - rgb - then we'd also want more than 1 kernel, we'd want one for each channel. We'd want more activations (or channels) on the green if we're building a green frog detector.
This is why we'd want a 3d kernel
Instead of doing an element wise multiplication of 3*3, we do a element wise multiplication of 3*3*3 - 27 - things. We still add them up into a single number.
But here we only have an output of 1 channel (5x5). How do we get an output of 3 channels RGB? We create other kernels.
We end up with a number of channels corresponding to the number of kernels we put. A common number is 16 channels, representing things like how much left edge what on this pixel, how much top edge, how much blue to red gradient, on these 9 pixels each with RGB.
Then you can do that again -
As we get deeper in the network, we want to add more and more channels - we want to be able to find richer and richer set of features.
However, in order to not have too many memory problems, we, from time to time, create a convolution where we dont step over every single set of 3 x 3, but instead we skip over 2 at a time. That is called a stride 2 convolution. It looks exactly the same, it's still a bunch of kernels, we're just jumping over 2 at a time in the matrix (waves below).
We skip every other input pixel. The output from that will be h/2 x w/2 x channels - but when we do that we usually create twice as many kernels, so now we can have 32 activations or channels.
Usually our first set of pixels are stride 2.
176 x 176 (the number of pixels - since it was stride 2 it's not 364x364 - you half that), 64 activations (channels),
Conv2d : 3 is the input , 64 is the output, the kernel_size is (7,7) - usually it's 3,3. And we see stride 2. If we use a larger kernel we use more padding to 3,3 padding.
As we go along we can increase the stride, and also increase the number of channels.
Example of bottom right edge kernel:
See lesson 6 pets more.
How to go from our conv of Height: 11, Width : 11, channels :512 - to our 37x1 size vector with all the probabilities of belonging to a class?
We take each of our faces - each of the 512 channels, and take the mean. This will give us a 512 long vector
Now we just have to pop that into a single matrix multiply of 512*37 - which will give us an output of 37x1.
The step where we take the average of each face is called average pooling. (pooling is also used to shrink the image stack
The outputs of each channel is actually a feature - we want to know, not so much whats the average of the 11x11 to get the vector, we want to know what's in each of the 11x11 spots?
What if instead of averaging accross the 11x11, we instead average across the 512. If we do that it's gonna give us a single 11x11 matrix. Each grid point in that matrix will be how activated was that area. When it came to figuring out that his was Main Coon, how many signs of Main Coon was there, in that part of the 11x11 grid. That's what we do to create our heatmap.
Context: We wanted deeper models to make them better. Theory: add more stride 1 convs, which don't shrink the input, and we can have more layers. However, when you do this, you get what happens on the left: the deeper networks is worse on the training error. They were the same models but one had more stride 1 convs.
The fix: skip connection - where every two conv layers, we add the identity to the output.
o = X + conv2(conv1(X)) - and this, at worse, should perform as well as the 20-layer one, because the model could set the weights to 0 and have X just pass through all the layers. You could skip over all the convolutions and be as good as the 20 layer network.
What happened when He did this? He won image net!! This is what RESNET IS. Stands for Residual Learning.
The figure on the right is what we call a resblock.
IF you go back to older Convs models, if they don,t have resblocks and you add them, you'll almost always get better performance.
Why?
In this paper they were able to visualize the loss surface of a neural net. Left: without a skip connection - This is why the model above without the skip connection and with more layers was not performing well. It was stuck in one of these valleys. With the skip connection the global minimum is easier to find. (it's the exact same network)