A question about categorical features

User 6 | 11/25/2014, 6:13:52 AM

From BDR: How do I know whether my categorical variables are being treated as factors in the model? Are they automatically treated as categorical if they are of type str?

Comments

User 6 | 11/25/2014, 6:14:49 AM

Hi BDR,

This is true - make sure your categorical variables are strings. This can be easily data using <code class="CodeInline">data['variable'] = data['variable'].astype(str)</code>

For example, in linear regression,

<pre class="CodeBlock"><code>model = graphlab.linear_regression.create(data, target="ActualElapsedTime") PROGRESS: Linear regression: PROGRESS: -------------------------------------------------------- PROGRESS: Number of examples : 123959 PROGRESS: Number of features : 28 PROGRESS: Number of unpacked features : 28 PROGRESS: Number of coefficients : 20589</code></pre>

The number of features in the data was 28, but they were expanded into 20589 coefficients because most of them were categoric variables.


User 3252 | 3/17/2016, 3:50:45 PM

Hello Danny,

It does not seem to work with random_forest classifier.

I tried it with the UCI data archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

I converted some numeric predictors to strings as follows: data['PAY_0'] = data['PAY_0'].astype(str) data['PAY_2'] = data['PAY_2'].astype(str) data['PAY_3'] = data['PAY_3'].astype(str) data['PAY_4'] = data['PAY_4'].astype(str) data['PAY_5'] = data['PAY_5'].astype(str) data['PAY_6'] = data['PAY_6'].astype(str) When I created a randomforest classifier to predict 'default', it showed the following: Number of examples : 24031 Number of classes : 2 Number of feature columns : 23 Number of unpacked features : 23

Note: It does not show the number of coefficients. What is the reason for not showing it? Could you explain? Thank you.


User 3252 | 3/17/2016, 3:54:59 PM

Correction: I expected it to show the number of predictors as done by the caret package in R.


User 1190 | 3/17/2016, 5:22:47 PM

Thank you for your feedback. We will take your feature request.


User 3252 | 3/17/2016, 6:07:03 PM

Hi Jay,

In the current version, is there a way to confirm whether the randomforest model has used the variables as categorical predictors?


User 1190 | 3/17/2016, 7:50:56 PM

Integer, float, and array typed columns will be treated as numeric predictors, whereas the rest will be treated as categorical. To use integer typed column as categorical predictors, you need to cast it to str typed column by using "sf['x'] = sf['x'].astype(str)".

To confirm the predictors used in randomforest model, you can use model.get_feature_importance() or model['trees_json'][0] function to inspect the model.


User 3252 | 3/17/2016, 9:37:28 PM

Thank you. That answers my question.