Tabular data
Last updated
Last updated
When we convert to categorical variables NaNs are converted to -1. But because we're going to use an embedding that doesn't recognize when values are -1, we add +1 to all the cats.
fill_missing - like proc_df. The fact that something is missing helps predict our outcome.
Selon fastai - you can replace the NaN values by almost any number because if it turns out that the missingness is important, it can use the interaction between the Na_column and the initial column to make predictions.
You need to tell it what your categorical/continous variables are and also which processing steps you want to use: fillmissing, categorify, normalize.
We add day of the month for ex as a cat var, because if there is a different behavior for the 15th, 30th and 1st of the month, it's going to create an embedding matrix and those diff days of the months can get diff behaviors.
Think carefully which things need to be where.
If the cardinality is not too high, better to put it as a categorical variable.
We have to tell fast ai that the class of the labels we want is a list of flaots, not a list of categories. So now, this becomes a regression problem.
Take the max of the price, then the log of that, and that will be our y_range. We multiply by 1.2 to have a range a little bit higher than the max to be able to reach it.
For tabular, it's a simple NN architecture. Fully connected models.