Perceptron Algorithm
Last updated
Last updated
Supervised learning, is a subcategory of Machine Learning, where learning data is labeled, meaning that for each of the examples used to train the perceptron, the output in known in advanced.
When considering what kinds of problems a perceptron is useful for, we can determine that it’s good for tasks where we want to predict if an input belongs in one of two categories, based on it’s features and the features of inputs that are known to belong to one of those two categories.
These tasks are called binary classification tasks. Real-world examples include email spam filtering, search result indexing, medical evaluations, financial predictions, and, well, almost anything that is “binarily classifiable.”
We have students that either go accepted or rejected for a school. We can create a line that separates in order to predict the outcome.
The general way to write this is:
Where W is the vector of weights, and x is the vector of variables. We take the product of the two.
b is the bias.
For coordinates x1, x2, we'll denote a label as Y and the label is what we are trying to predict. Accepter: Y = 1.
Our prediction : y_hat. If y_hat is 1, if Wx + b >= 0.
Therefore, points above the line have Wx+b >= 0 and below <= 0.
The goal of the algorithm is to have Y_hat resembling Y as clsely as possible. This is exactly equivalent of finding the boundary ligne which keeps the blue points above the ligne and the red points below the lines.
difference is our vectors will have n entries instead of 2.
We can think of the perceptron like this:
We put our boundary line (which is our equation) in the node. We also put the small nodes with our data about the tests and the grades.
The perceptron plots the point 7,6 and checks if the label is 1 or 0. So we recall our equation and our prediction if equation >= or <= 0.
Two ways to look at it:
1: Bias in the node:
2: bias is part of the input:
So we get:
General:
The node checks if Wx + b >= 0, if yes, the node becomes 1, if not, 0.
So there are 2 nodes:
The 1st node applies the linear function to the weights. The second node is the step function, which outputs either 1 or 0. We'll be using different functions so it's important to specify it in the node.
AND Perceptron
Green: True, Red: False. We get the same thing with 1 and 0.
Or Perceptron
The line has different weights and bias.
NOT Perceptron
Unlike the other perceptrons we looked at, the NOT operation only cares about one input. The operation returns a 0
if the input is 1
and a 1
if it's a 0
. The other inputs to the perceptron are ignored.
XOR Perceptron
In the last section you used your logic and your mathematical knowledge to create perceptrons for some of the most common logical operators. In real life, though, we can't be building these perceptrons ourselves. The idea is that we give them the result, and they build themselves. For this, here's a pretty neat trick that will help us.
Say we have a linear equation like above. We also have a point (4,5). How does that point influence the line? We add the coeficients by the point coordinates (and subtract 1 from the bias, because the point has a bias of 1 in our example).
But we don't actually want to add the value of the point, that change would be too important. Instead we multiply by a certain learning rate that we specify.
This was for a point in the positive area. The same applies for the neg area, but instead of adding et subtract. New line:
Pseudo code for the perceptron algorithm
Where alpha is the learning rate and b is the bias unit. Repeat until we get no errors, or where errors are small, or after x number of iterations.
We'll need to update our perceptron algo to make it work for curves.
We're going to use error functions to minimize error.
(for 2 features.. I think.. because W[0] and W[1] only in the following code)
Another way:
See https://medium.com/@thomascountz/19-line-line-by-line-python-perceptron-b6f113b161f3 for details.
Problem: we're doing classification of 2 classes so we have either a 0 or a 1 as a prediction. What we want is to turn that into a probability to make it continous.
1 for the blue points, and 0 for the red points, which is why the bottom right point is close to 0 at 0.2.
The probs are a function of the distance to the line. How do we go from one to the other?
When we had 2 labels we applied the sigmoid function to the scores.
Now that we have more than 2 labels, we apply the softmax function to the scores. Ex: duck gets a score of 2, beaver 1, walrus 0
So softmax function is:
Softmax Python code
Maximizing this probability, when the we increase the prob of each point is call maximum likelihood.
However, using products is not ideal, we'd rather use sums. Therefore, we can take the log. We use the natural logarithm.
Because ln of anything less than 1 is negative, by taking the negative log we get positive numbers.
In fact, the bottom line is called Cross Entropy. Good model will give us a low cross entropy, a bad one will give us a high x-entropy.
Badly classified points will give us a high number:
Therefore our cross entropy will tell us if our model is good or bad. So our goal is now to minimize the cross entropy. This is make us go from the model to the left to the model to the right.
Formula:
Multiclass cross entropy formula
But the error is based on our weights W and b:
This is for binary classification. For multiclassification we need to take the multiclass cross entropy formula:
We minimize these functions by using gradient descent.
The goal is to find the optimal way to fit a distribution to the data. There are lots of types of distributions for different types of data. Ex: normal distribution, exponential dist, Gamma, dist... and many more. You wanna fit a distribution to your data is it can be easier to work with and it is also more general, it applies to every experiment of the same type.
For normal dist for example they can be skinny, "fat", etc.. We also need to find where we center the thing.
We can try to center accross all data points.
We want the location that maximises the likelyhood of observing the weights we measured. Our maximum likelyhood estimate for the mean is:
Now we need to find the max likelihood for the std deviation.
The equation for the normal distribution is:
Each time we change mu, we calculate the likelihood and plot it:
Now we can fix mu = 32 in our equation. Then do the same with sigma to get the std deviation. Then we finally find the max likelyhood.
If you have 2 measures X1 and X2, then the likihood is one times the other:
What we need to do is take two different derivatives of this equation : 1 with respect to mu and finding when it is equal to 0, the other wrt sigma and finding when it is 0.
Before we take any derivatives, we take t he log of the likelihood function. We do this because it will make our derivatives way easier.
This is the log of our likelihood function:
This simplifies to this:
same for sigma
Finally we have the normal distribution