Smarter defaults for Graphlab algorithms

User 2032 | 1/28/2016, 9:49:44 PM

Hi guys,

While working with various algorithms I noticed that some of them may have very suboptimal defaults. Especially for tree based algorithms I do not understand the rational behind the defaults and even consider some of them harmful (comparing to having no defaults and asking the user to make the call themselves). Examples would include the depth of trees, the number of trees in random forests, the minchildweight or not setting class_weights to 'auto' by default. Some inspiration to what defaults to set can be found here. Additionally (for random forests) I would also ad things like min_child_weight = k*(1/minority_class_frequency) and make trees much deeper so they can create imbalanced trees that better separate the minority class for imbalanced datasets. We can talk some more on this if you like, you know how to reach me;)

Comments

User 1207 | 2/1/2016, 4:42:47 PM

Hello JohnnyM,

Thanks for the feedback, and your suggestions make sense. Our defaults are the way they are essentially for two reasons -- internally, we have a number of datatsets, both private and public, that we use to choose defaults that are decent on almost everything and don't take an extremely long time to run. It's a balance of accuracy, run time, and ensuring there aren't common types of data sets out there where they perform horribly. Certainly, on any individual dataset, you could get better results.

The second reason is for backwards compatibility -- when we change the defaults, or add in better options, then it can change the behavior of the algorithm, as many users just use the defaults. (This is what happened with the class weights = 'auto' option). When we release a version that breaks API compatibility, then we'll be revisiting all of these defaults, and we take your suggestions into account :-).

Thanks! -- Hoyt