How are missing data treated in the classifier toolkit?

User 862 | 10/21/2014, 10:31:58 PM

I really like that you can pass a dictionary column as a feature to the classifiers. I'm wondering, however, how missing data are treated? For example: feature:[{'a':10,'b':1},{'a':8,'b':2},{'b':2},{'a':10}], target=[0,1,1,0]

Cases 3 and 4 are missing predictor 'a' and 'b', respectively. Everything seems to work in this situation, but I'd love to know how it is handled and how it would compare to doing data imputation of some kind ahead of time.


User 91 | 10/21/2014, 11:41:27 PM

The classifier toolkit imputes missing values using the mean value computed during training. If an entire column is missing, then each value is imputed with the mean value during training. This guarantees a result irrespective of what your data looks like. This is highly desirable when your model is running as a GraphLab predictive service.

There are several other ways and more sophisticated models to imputing missing data. The mean value during training is a simple and efficient way to do so. Other models might be more accurate (depending on your data) but they could take a lot longer to impute the values.

Let us know what your thoughts are, and we are happy to include them in our product.