Number of random evaluations per sample with NeuralNetClassifier.evaluate

User 1478 | 3/15/2015, 9:34:45 PM

Does NeuralNetClassifier.evaluate only predict once per test sample, or does it average multiple predictions e.g. when random_crop=True?

What is the meaning of recall@1 and recall@n, which is an optional metric parameter for NeuralNetClassifier.evaluate ?

Comments

User 940 | 3/16/2015, 5:35:18 AM

Hi Unas!

NeuralNetClassifier only predicts once per sample, though averaging a few would certainly improve accuracy. I'll put that one on the list.

recall@1 is simply accuracy, and recall@n is the fraction of the time the correct answer is in the top n categories.

Cheers! -Piotr


User 1319 | 3/16/2015, 8:59:54 PM

Hi @piotr ,

I've been experimenting with NeuralNetClassifier.classify and NeuralNetClassifier.predict (for image classification). The same model produces significantly different results when classify or predict are called several times on the same dataset.

As you know, this is a serious issue in terms of reproducible research, especially if the results are to be reproduced by another party.

Q1 - I understand you have this on your list, if I may ask, when do you expect this issue will be fixed?

In the meantime, I am thinking about writing a method to fuse several classify calls of the same model (or several models with different number of iterations). I'm considering two approaches: 1- Calling classify (or predict) an odd number of times, and then use voting to select the class. 2- Calling classify several times and selecting the class with the maximum average probability (score).

Q2 - Any advice on which approach might produce robust results?

Q3 - Calling extract_features several times will produce different features for the same dataset, right?

Cheers,

Tarek


User 940 | 3/18/2015, 6:18:28 AM

Hi @tabdunab ,

Q1 Is really a two part question. You're right, non-reproducibility is a serious issue. This will be fixed by the next release, by taking one crop at the center. This is something I was unaware of, thanks for bringing it up.

The follow up is whether we should take several crops. It would certainly increase accuracy, but would also decrease prediction-time throughput. In any case, this would too have to be deterministic. Probably a center crop, plus corner crops. This is something that we will consider adding to the roadmap.

Q2. Averaging should be more robust. It's possible that each prediction yields a different class, in which case voting would fail.

Q3. If the bug persists in the extract_features as well, then yes. I will investigate this.

For now, it may be possible to downgrade the GraphLab Create version. I will investigate this further.

-Piotr


User 940 | 3/18/2015, 7:55:34 AM

Hi @tabdunab,

After some investigation, this is what I've found.

In regards to Q3, yes extracted features become non-deterministic.

For prediction purposes, the most robust thing to do is average. In fact, this should even provide better results than just the center crop. Here's a code snippet to do that:

def averagedprediction(model, sf, numclasses, num_samples = 10):

<pre> """ Average out predictions based on random crops and random mirror. """ prob = model.predicttopk(sf, k=numclasses).sort(['rowid', 'class']) for i in range(numsamples-1): print "Making prediction : %s" % i prob['score'] = prob['score'] + \ model.predicttopk(sf, k=numclasses).sort(['rowid', 'class'])['score'] prob['score'] = prob['score'] / (numsamples * 1.0) return prob </pre>

The issue was not present in GraphLab Create 1.1, but there was also no support for string target types, so you would have to enumerate your targets.

Sorry for the trouble.

Cheers! -Piotr


User 1319 | 3/18/2015, 3:13:29 PM

Thanks a lot @piotr for your time and the code snippet.

Cheers! Tarek


User 940 | 4/15/2015, 1:34:37 AM

Hi @tabdunab

I just wanted to touch base about this issue. Are you blocked on it still, or has the workaround worked for you?

Cheers! -Piotr