BoostedTreesClassifier missing value handling

User 3030 | 2/1/2016, 4:24:21 PM

I know rpart trees or gbm in R uses surrogate splitting, and xgboost library got it's own complex imputation strategy. What strategy does graphlab's implementation of boosted trees use to handle missing values? Also, if BoostedTreesClassifier tries to unpack a column of type dict (with numeric values), what value will be assigned to a row/observation corresponding to the unpacked column where a key-value pair is missing? Will it be assigned to zero or None?

Comments

User 91 | 2/2/2016, 9:00:45 PM

Graphlab create uses the following strategy (see figure below)

  • If a node contains missing values, we "learn" whether the missing values for data points in that split should go left or right (using information gain as the metric).

To learn whether the missing values should go left or right, we do the following Step 1. Compute the information gain if the missing values go left. Step 2. Compute the information gain if the missing values go right. Step 3. Pick the better of the two (left or right).

If the training data in the node does not contain any missing values, then we assume that the missing value in the split goes left (default-left). The reason we need to do this is to make sure every split contains a branch where missing values can go to (to make sure predictions always work).