Label Encoder

User 984 | 12/3/2014, 9:49:22 PM

I may not be privy to a current feature that allows this, but seeing as many machine learning algorithms only accept numeric values it would be nice to have some functionality similar to scikit's LabelEncoder. This would allow interoperability between algorithms that take categorical variables (as strings) and numeric variables (as levels/factors such as 0, 1, 2, ..., n).


User 398 | 12/3/2014, 10:55:00 PM

Hi Bill. Presently, our toolkits perform this label encoding for you automatically. In the case of string labels, this is done by sorting the unique label values alphabetically, and then mapping to integers. (See the description of the target parameter in the <a href="">boosted trees</a> API documentation, for example.) Is there additional functionality that you'd like to see us expose?

User 984 | 12/4/2014, 1:04:11 AM

Ahh, turns out I was interpreting the error message incorrectly. It was complaining about non-numeric labels and not non-numeric features. Though I am now faced with error

<pre class="CodeBlock"><code>graphlab.toolkits._main.ToolkitError: Neuralnet supports only one image typed column, or multiple int, float or array typed columns. </code></pre>

I double checked and all of my columns are of type <code class="CodeInline">str</code>, does GraphLab also do that autoencoding on features as well as labels? In this case I would need a 1-of-n encoding or a "one hot" encoding so that I'm not saying "a" is larger than "b" just because "a" is encoded as 1 and "b" as 2.

User 398 | 12/4/2014, 1:12:39 AM

Ah, yes. At the moment, the neural network classifier toolkit is an exception; it doesn't yet handle the label encoding for you. This will be fixed in our next release (v1.2).

User 984 | 12/4/2014, 3:21:54 AM

Simple work around until then:

<blockquote class="Quote"> import pandas as pd import graphlab as gl train = gl.loadsframe("...") encodedfeatures = gl.SFrame(pd.getdummies(train.todataframe()))</blockquote>