Column types and unusable classifiers

User 1273 | 3/24/2015, 3:59:44 PM

I have the following problem: a training csv loaded in an SFrame is used to train a classifier which is then saved. Later I load the classifier and load a test csv in an SFrame; anyway the test SFrame has different column types than the train one so the model is unusable because it doesn't work. I know I can set the types manually, but I have no access to the train csv but only to the saved model; how can I solve this? (the train csv have mixed types: float, str and int)


User 398 | 3/24/2015, 4:41:43 PM

Hi Vince,

Let me make sure I understand. You have trained a classifier with one data set, but no longer have access to that SFrame. You would like to evaluate the classifier's performance on a different data set. What version of GraphLab Create are you using? What kind of classifier is it? You can get a list of column names that were used as features to train the model with the model's get method:

<code class="CodeInline">model.get("features")</code>

Similarly, you can get the name of the column used as the target label with the get method:

<code class="CodeInline">model.get("target")</code>

Once you have those, you'll want to make sure that you have the corresponding columns in your test set. Then you evaluate your model on your test set using the model's evaluate method:

<code class="CodeInline">model.evaluate(test_dataset)</code>

Be warned: if you can't perform identical feature engineering steps on your test data (because you don't know what those steps were at training time), you may find that the classifier doesn't perform as well as it should on your test data.

User 1273 | 3/24/2015, 4:57:28 PM

No, that's not the problem. I know the column names as long as the target column.

When I try to classify I get the error 'the column abc cannot be treated as categorical', i.e. that column has a different columntype than that used in the training. I don't know how to find the columntypes used in the training (or how to solve the problem in a different way) because I don't have access to the training csv.

I have GL 1.2.1, and I have tested a boostedtrees, a regressor and even the general classifier.create, they all give the same problem (because SFrame.readcsv is not inferring correctly - or differently between train and test - the columns).

User 398 | 3/24/2015, 6:21:55 PM

Hi Vince,

Unfortunately, there is no direct way to get the types of the columns from the training set without access to that original SFrame. However, you can inspect the coefficients of your model like so:

<code class="CodeInline">model.get("coefficients")</code>

Any column from your training data set with non-None values in the index column in the model's coefficients SFrame is a categorical variable. For example, I created a dummy logistic classifier using a dictionary column of word counts as a feature, and here is the corresponding coefficients SFrame:

<pre class="CodeBlock"><code>In [72]: model.get("coefficients").printrows(numrows=40) +-------------+------------+-------+-------------------+ | name | index | class | value | +-------------+------------+-------+-------------------+ | (intercept) | None | 1 | 0.834368231673 | | wc | account | 1 | 0.827439235845 | | wc | download | 1 | 0.827439235845 | | wc | million | 1 | 0.827439235845 | | wc | dollars | 1 | 0.827439235845 | | wc | wire | 1 | 0.827439235845 |</code></pre>

This might help you get your test data into an analogous form. Let me know.