Classification
Last updated
Last updated
Rather than having a model simply assign a binary classification, most algorithms can return a probability score (propensity) of belonging to the class of interest. In fact, with logistic regression, the default output from R is on the log-odds scale, and this must be transformed to a propensity. A sliding cutoff can then be used to convert the propensity score to a decision. The general approach is as follows:
Establish a cutoff probability for the class of interest above which we consider a record as belonging to that class.
Estimate (with any model) the probability that a record belongs to the class of interest.
If that probability is above the cutoff probability, assign the new record to the class of interest.
More than two categories?
Even in the case of more than two outcomes, the problem can often be recast into a series of binary problems using conditional probabilities. For example, to predict the outcome of the contract, you can solve two binary prediction problems:
Predict whether Y = 0 or Y > 0.
Given that Y > 0, predict whether Y = 1 or Y = 2.
In this case, it makes sense to break up the problem into two cases: whether the customer churns, and if they don’t churn, what type of contract they will choose. From a model-fitting viewpoint, it is often advantageous to convert the multiclass problem to a series of binary problems. This is particularly true when one category is much more common than the other categories.
The naive Bayes algorithm uses the probability of observing predictor values, given an outcome, to estimate the probability of observing outcome Y = i, given a set of predictor values.
Conditional probability The probability of observing some event (say X = i) given some other event (say Y = i), written as
Posterior probability The probability of an outcome after the predictor information has been incorporated (in contrast to the prior probability of outcomes, not taking predictor information into account).
To understand Bayesian classification, we can start out by imagining “non-naive” Bayesian classification. For each record to be classified:
Find all the other records with the same predictor profile (i.e., where the predictor values are the same).
Determine what classes those records belong to and which class is most prevalent (i.e., probable).
Assign that class to the new record.
The preceding approach amounts to finding all the records in the sample that are exactly like the new record to be classified in the sense that all the predictor values are identical.
Predictor variables must be categorical (factor) variables in the standard naive Bayes algorithm.
Bayes Theorem
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors. This assumption is called class conditional independence.
P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Why Exact Bayesian Classification Is Impractical When the number of predictor variables exceeds a handful, many of the records to be classified will be without exact matches. This can be understood in the context of a model to predict voting on the basis of demographic variables.
The Naive Bayesian classifier is based on Bayes’ theorem with the independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods.
In the naive Bayes solution, we no longer restrict the probability calculation to those records that match the record to be classified. Instead, we use the entire data set. The naive Bayes modification is as follows:
For a binary response Y = i (i = 0 or 1), estimate the individual conditional probabilities for each predictor P(Xj | Y = i) ; these are the probabilities that the predictor value is in the record when we observe Y = i. This probability is estimated by the proportion of Xj values among the Y = i records in the training set.
Multiply these probabilities by each other, and then by the proportion of records belonging to Y = i.
Repeat steps 1 and 2 for all the classes.
Estimate a probability for outcome i by taking the value calculated in step 2 for class i and dividing it by the sum of such values for all classes.
Assign the record to the class with the highest probability for this set of predictor values.
This naive Bayes algorithm can also be stated as an equation for the probability of observing outcome Y = i, given a set of predictor values : X1, ... , Xp:
The value of P(X1, X2, ... , Xp) is a scaling factor to ensure the probability is between 0 and 1 and does not depend on Y:
Why is this formula called “naive”? We have made a simplifying assumption that the exact conditional probability of a vector of predictor values, given observing an outcome, is sufficiently well estimated by the product of the individual conditional probabilities P(Xj | Y = i). I nother words, in estimating P(Xj | Y = i) instead of P(X1, X2, ... , Xp | Y = 1) we are assuming Xj is independant of all the other prdictor variables Xk for k =/=j.
The naive Bayesian classifier is known to produce biased estimates. However, where the goal is to rank records according to the probability that Y = 1, unbiased estimates of probability are not needed and naive Bayes produces good results.
Bayes Rule
Naive Bayes
Left hand side of the equation = posterior probability or simply the posterior
‘Likelihood of Evidence’. It is nothing but the conditional probability of each X’s given Y is of particular class ‘c’.
Prior which is the overall probability of Y=c, where c is a class of Y. In simpler terms, Prior = count(Y=c) / n_Records
.
The value of P(Orange | Long, Sweet and Yellow) was zero in the above example, because, P(Long | Orange) was zero. That is, there were no ‘Long’ oranges in the training data.
It makes sense, but when you have a model with many features, the entire probability will become zero because one of the feature’s value was zero. To avoid this, we increase the count of the variable with zero to a small value (usually 1) in the numerator, so that the overall probability doesn’t become zero.
This correction is called ‘Laplace Correction’. Most Naive Bayes model implementations accept this or an equivalent form of correction as a parameter.
So far we’ve seen the computations when the X’s are categorical. But how to compute the probabilities when X is a continuous variable?
If we assume that the X follows a particular distribution, then you can plug in the probability density function of that distribution to compute the probability of likelihoods.
If you assume the X’s follow a Normal (aka Gaussian) Distribution, which is fairly common, we substitute the corresponding probability density of a Normal distribution and call it the Gaussian Naive Bayes. You need just the mean and variance of the X to compute this formula.
Where mu and sigma are the mean and variance of the continuous X computed for a given class ‘c’ (of Y).
To make the features more Gaussian like, you might consider transforming the variable using something like the Box-Cox to achieve this.
Try transforming the variables using transformations like BoxCox or YeoJohnson to make the features near Normal.
Try applying Laplace correction to handle records with zeros values in X variables.
Check for correlated features and try removing the highly correlated ones. Naive Bayes is based on the assumption that the features are independent.
Feature engineering. Combining features (a product) to form new ones that makes intuitive sense might help.
Try providing more realistic prior probabilities to the algorithm based on knowledge from business, instead of letting the algo calculate the priors based on the training sample.
Further reading, examples and source: https://www.machinelearningplus.com/predictive-modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/
KEY TERMS FOR DISCRIMINANT ANALYSIS
Covariance A measure of the extent to which one variable varies in concert with another (i.e., similarmagnitude and direction).
Discriminant function The function that, when applied to the predictor variables, maximizes the separation of the classes.
Discriminant weights The scores that result from the application of the discriminant function, and are used to estimate probabilities of belonging to one class or another.
While discriminant analysis encompasses several techniques, the most commonlyused is linear discriminant analysis, or LDA. The original method proposed by Fisher was actually slightly different from LDA, but the mechanics are essentiallythe same.
LDA is now less widely used with the advent of more sophisticatedtechniques, such as tree models and logistic regression. However, you may still encounter LDA in some applications and it has links to other more widely used methods (such as principal components analysis; see “Principal Components Analysis”).
Linear discriminant analysis should not be confused with Latent Dirichlet Allocation, also referredto as LDA. Latent Dirichlet Allocation is used in text and natural language processing and isunrelated to linear discriminant analysis.
To understand discriminant analysis, it is first necessary to introduce the conceptof covariance between two or more variables. The covariance measures therelationship between two variables X and Z. Denote the mean for each variableby X-hat and Z-hat (see “Mean”). The covariance S x,z between X and Z is given by:
where n is the number of records.
Correlation is constrained to be between –1 and 1, whereas covariance is on the same scale as the variables X and Z. The covariance matrix Epsilon for X and Z consists of the individual variable variances, S^2 x and S^2 z, on the diagonal (where row and column are the same variable) and the covariances between variable pairs on the off-diagonals.
NOTE
Recall that the standard deviation is used to normalize a variable to a z-score; the covariancematrix is used in a multivariate extension of this standardization process. This is known as Mahalanobis distance (see Other Distance Metrics) and is related to the LDA function.