Logistic Classifier - Toolkit error

User 2574 | 11/22/2015, 7:59:30 PM

If I was able to create logistic model using the training data, then I should be able to evaluate the model using the test data. However it throws an error. Both the training and test data are sourced from the main data set using random.split()

Using Graphlab Create ver 1.7.1 Built a model using logistic classifier create. Here are the steps: train_data, test_data = newTrain.random_split(0.8,seed=9) model = gl.logistic_classifier.create(train_data,target='Type',features=['__feature columns__'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
You can set ``validation_set=None`` to disable validation tracking.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 72500
PROGRESS: Number of classes           : 38
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 10277
PROGRESS: Number of coefficients    : 380434
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 0.000014  | 5.587978     | 0.392993          | 0.341675            |
PROGRESS: | 2         | 5        | 1.000000  | 110.070162   | 0.525434          | 0.444747            |
PROGRESS: | 3         | 6        | 1.000000  | 214.139210   | 0.645834          | 0.538404            |
PROGRESS: | 4         | 7        | 1.000000  | 314.223501   | 0.673903          | 0.550545            |
PROGRESS: | 5         | 8        | 1.000000  | 419.494436   | 0.720621          | 0.578295            |
PROGRESS: | 6         | 9        | 1.000000  | 521.120737   | 0.738607          | 0.589941            |
PROGRESS: | 7         | 10       | 1.000000  | 626.249446   | 0.756221          | 0.601586            |
PROGRESS: | 8         | 11       | 1.000000  | 728.071896   | 0.788828          | 0.607284            |
PROGRESS: | 9         | 12       | 1.000000  | 831.920440   | 0.836317          | 0.617443            |
PROGRESS: | 10        | 13       | 1.000000  | 938.856761   | 0.851379          | 0.610505            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: TERMINATED: Iteration limit reached.
PROGRESS: This model may not be optimal. To improve it, consider increasing `max_iterations`.

model

Class                         : LogisticClassifier
Schema
------
Number of coefficients        : 380434
Number of examples            : 72500
Number of classes             : 38
Number of feature columns     : 4
Number of unpacked features   : 10277

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 10
Solver status                 : TERMINATED: Iteration limit reached.
Training time (sec)           : 1044.7124

Settings
--------
Log-likelihood                : 34362.9595

Highest Positive Coefficients
-----------------------------
aggMidLevel[578.COOK AND DINE]: 27.2847
aggMidLevel[510.COOK AND DINE]: 26.6002
aggMidLevel[5500.TOYS]        : 26.4404
aggMidLevel[8104.SPORTING GOODS]: 25.4658
aggMidLevel[7838.ELECTRONICS] : 24.9053

Lowest Negative Coefficients
----------------------------
aggMidLevel[5500.TOYS]        : -17.4981
aggMidLevel[6844.INFANT APPAREL]: -16.4609
aggMidLevel[510.COOK AND DINE]: -15.6574
aggMidLevel[113.LAWN AND GARDEN]: -15.2366
aggMidLevel[6844.INFANT APPAREL]: -14.8313

results = model.evaluate(test_data)

[ERROR] Toolkit error: Prediction scores/probabilitMarkdown`�I�M!	��7#	++����FYI: If you are using Anaconda and having problems with NumPyHello everyone,

I ran into an issue a few days ago and found out something that may be affecting many GraphLab users who use it with Anaconda on Windows. NumPy was unable to load, and consequently everything that requires it (Matplotlib etc).

It turns out that the current NumPy build (1.10.4) for Windows is problematic (more info here).

Possible workarounds are downgrading to build 1.10.1 or forcing an upgrade to 1.11.0 if your dependencies allow. Downgrading was easy for me using conda install numpy=1.10.1

Thanks for your attention!

RafaelMarkdown558,824,8414L���4L���179.110.206.156179.110.206.1564P�}��Xj�8\j�1str�"��\j�Xj��\j�8bj�րi�1(׀i��g��b�j����Xj�\j�Xj�8\j�1.hpp(decrementdistributedcounter:787): Distributed Aggregation of likelihood. 0 remaining. INFO: distributedaggregator.hpp(decrementdistributedcounter:793): Aggregate completion of likelihood Likelihood: -3.22336e+08 INFO: distributedaggregator.3HLABDISABLELAMBDA_SHM"] = "1" os.environ["GRAPHLABFORCEIPCTOTCP_FALLBACK"] = "1" import graphlab as gl

3. Test out your lambda worker code in this environment. If it works, then you can make the above configuration permanent by running:

gl.sys_util.write_config_file_value("GRAPHLAB_DISABLE_LAMBDA_SHM", "1")
gl.sys_util.write_co

Comments

User 2574 | 11/23/2015, 9:19:22 PM

Hi Brian, I will try with model.predict. I most likely do not have missing values unless it interprets the string 'NULL' as a missing value. I also wanted to let you know when I build a model with 'boosted classifier', the model.evaluate does not throw an error. It seems to happen with the logistic classifier for the moment. I am using a seed, so both the classifier models are getting the same training, testing set.


User 91 | 11/24/2015, 8:38:56 AM

The string NULL does not get interpreted as missing values. There is another possibility that the output of the predictions for logistic regression contains NaN (i.e not a number) values.

Could you share the data that you have used for this example so we can try and reproduce the issue and diagnose it? We would appreciate if that could be done.


User 2574 | 11/25/2015, 8:19:30 PM

It wont let me upload the data file (too large). Any recommendations?


User 1592 | 11/26/2015, 5:13:08 PM

please send us the data to contact@dato.com or open a dropbox folder and send us the link.


User 2574 | 11/30/2015, 5:52:12 AM

Please see attached.


User 2450 | 1/4/2016, 6:53:57 AM

I'm having the same problem. I reproduced the problem with the same data attached by Javier. I'm using Graphlab Create ver 1.7.1.

data.csv was in Javier's data.rar. newTrain = graphlab.SFrame('data.csv')

I only used some columns which looked relevant. newTrain = newTrain['TripType', 'aggDeptPos','aggDeptNeg','aggMidLevel','aggDeptPosTfidf']

Split train/test , then create a classifier. train_data, test_data = newTrain.random_split(0.8,seed=9) model = gl.logistic_classifier.create(train_data,target='TripType')

This resulted in a slightly different model from Javier's, but mostly similar.

And, when I run the model.evaluate, I've got the same error. results = model.evaluate(test_data)

When I looked at the actual prediction probabilities with the following code, I actually got NEGATIVE probabilities. prediction_top38 =model.predict_topk(test_data,k=38) prediction_top38.sort("probability",ascending=True).print_rows(num_rows=200)

+-------+-------+--------------------+ | id | class | probability | +-------+-------+--------------------+ | 7787 | 3 | -4.4408920985e-16 | | 14034 | 3 | -4.4408920985e-16 | | 13966 | 3 | -2.22044604925e-16 | | 4574 | 3 | -2.22044604925e-16 | | 8994 | 3 | -2.22044604925e-16 | | 4241 | 3 | -2.22044604925e-16 | | 12834 | 3 | -2.22044604925e-16 | | 3767 | 3 | -2.22044604925e-16 | | 12858 | 3 | -2.22044604925e-16 | | 18115 | 3 | -2.22044604925e-16 | | 3469 | 3 | -2.22044604925e-16 | | 14402 | 3 | -2.22044604925e-16 |

It seems that these negative probabilities are all on the order of e-16. Is it possible that the probability for Nth category is computed by 1-(sum of probabilities of 1st to (N-1)th categories), instead of computed by the usual sigmoid function? If this is the case, is it possible that some funny rounding of floating point numbers resulted in these negative probability?


User 91 | 1/4/2016, 5:39:19 PM

It looks like this might be a rounding error. Thanks for pointing it out, we will fix it in the next release.