Using boosted_trees_classifier with High-Cardinality Categorical Features

User 2568 | 2/2/2016, 7:30:22 AM

I'm working on a problem that is a multi-class classification that uses some high-cardinality categorical features. I wanted to get some insights into how boostedtreesclassifier treats these and what pre-processing might be needed.

The categorical features are
- location: 929 classes in the training data, 1039 in the test data. 197 (19%) of the classes in the test data are not seen in the training data - logfeature: 331 classes in training, 335 in test, 55 (16%) in test not found in training - eventtype: 49 classes in training, 53 in test, 4 ( 8%) in test not found in training resourcetype: 10 classes in training, 10 in test, 0 ( 0%) not found in training - severitytype: 5

My question is, will boostedtreesclassifier handle these high-cardinality categorical features well, or to get the BEST results do I need to do some preprocessing to deal with the large number of classes and/or the classes that appear in the training set but not seen in the test set? If so, - what is considered too many classes, i.e., more that ten. - What pre-processing makes sense?

I read an interesting article called "A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems", that proposes calculating Empirical Bayesian conditional probabiliy for each class and using that as the feature. I'll probably try implementing this, however I'm interested to know if this is necessary or is this already done in the boostedtreesclassifier

Comments

User 91 | 2/2/2016, 9:11:28 PM

This is a good question. Handling large categorical values is tricky. Bin counting is something that works well.

Here is the basic procedure

  • Split the training set into a (pre-training set , and model-training set)
  • For each categorical variable, count the number of times each category occurs in the pre-training set for positive examples and negative examples.
  • In the model-training set, replace the categorical variables with "count" features which count the number of positive and negative examples in a subset of the training dataset (called pre-training)

Using this process, we reduce the model to a 2-feature dimensions for each categorical variable. There are many variants of this, see this article for more info on this (http://blogs.technet.com/b/machinelearning/archive/2015/02/17/big-learning-made-easy-with-counts.aspx)


User 2568 | 2/3/2016, 1:54:04 AM

Thanks ... this is verv similar to what I was proposing.


User 3242 | 2/24/2016, 10:42:07 AM

Ty


User 3252 | 2/27/2016, 11:36:48 PM

Hi Sri,

Could you please provide an example for Step 3 in your basic procedure? I appreciate your help.

-In the model-training set, replace the categorical variables with "count" features which count the number of positive and negative examples in a subset of the training dataset (called pre-training)


User 19 | 2/29/2016, 6:38:03 PM

Hi Ram,

We should be releasing a feature engineering object that helps you do Step 3. That should be available as soon as early April.

Cheers, Chris