Discretization of data

User 2915 | 12/28/2015, 6:49:13 PM

I have a question guys about categorizing data! So, I believe in sklearn everything is treated as numbers(ints) for any model(regression, random forest) you try! Okay, I see why do you want to convert string categories(like male/female) to 0 and 1. My question(s) are:

What if I have a column with 6 categories, but they are already numbers(from 1 to 6). Do I need to create dummy variables which include ONLY 0 or 1? Or should I leave it as it is?

Is '1' and '2' categorical data same as '0' and '1'? Or is the latter better so that the model does not consider it actual numbers! I'm confused.


User 1592 | 12/28/2015, 7:17:10 PM

Hi Please take a look at our documentation here: https://dato.com/learn/userguide/supervised-learning/linear-regression.html which explains how categorical features are treated. In a nutshell both strings and integers are treated automatically as categorical features and expanded to binary columns. This is equivalent to pandas get_dummies, just without having to call it explicitly and of course much faster. If you like to keep the integer features you can convert them to floats.