User 2568 | 2/2/2016, 7:30:22 AM
I'm working on a problem that is a multi-class classification that uses some high-cardinality categorical features. I wanted to get some insights into how boostedtreesclassifier treats these and what pre-processing might be needed.
The categorical features are
- location: 929 classes in the training data, 1039 in the test data. 197 (19%) of the classes in the test data are not seen in the training data - logfeature: 331 classes in training, 335 in test, 55 (16%) in test not found in training - eventtype: 49 classes in training, 53 in test, 4 ( 8%) in test not found in training resourcetype: 10 classes in training, 10 in test, 0 ( 0%) not found in training - severitytype: 5
My question is, will boostedtreesclassifier handle these high-cardinality categorical features well, or to get the BEST results do I need to do some preprocessing to deal with the large number of classes and/or the classes that appear in the training set but not seen in the test set? If so, - what is considered too many classes, i.e., more that ten. - What pre-processing makes sense?
I read an interesting article called "A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems", that proposes calculating Empirical Bayesian conditional probabiliy for each class and using that as the feature. I'll probably try implementing this, however I'm interested to know if this is necessary or is this already done in the boostedtreesclassifier