Graphlab BoostedTreesRegression returns inconsistent JSON trees

User 2124 | 2/29/2016, 11:36:39 AM

Hi,

We are using Graphlab commercial version to train our BoostedTreesRegression model on a batch server. We then export the JSON trees of the trained model (using BoostedTreesRegression.get( 'treesjson' ) method) to a text file and move it to another server (say, the Web server) that we can only run Java/Scala (not Python or C++). To use this model on the Web server, we must re-implement the BoostedTreesRegression.predict method in Scala. The implementation simply reads the JSON trees, parses them and then searches these trees for the leaves corresponded to the input data. Finally we sum up all the values at the leaves and add 0.5 to get the final predicted value. The above procedure works well for about 80% of the test data (it yielded nearly identical results between our own implementation of the 'predict' method (in Scala) and the Graphlab original 'predict' method in Python). However, for about 20% of the test data, it gave very large differences between Scala and Python. These differences seem not because of the floating point error between Scala and Python, but they seem from the JSON trees themselves (e.g., Python BoostedTreesRegression.predict method gave 0.48 for a predicted value but in Scala we got 0.18). This behavior frequently appears if we increase the number of trees (the maxiterations parameter, e.g, 100) and depth (the maxdepth parameter, e.g, 10) in the model. Here we attach a binary exported model with maxdepth = 4 and max_iterations = 100. We also attach an SFrame that contains one entry that yields the difference between Python and Scala. In Python, you can load the model and the SFrame into Graphlab like this :

python import graphlab as gl model = gl.load_model( 'e1618.model' ) sf = gl.load_sframe( 'data1618.sframe' ) pred = model.predict( sf ) # get the predicted result json_array = model.get( 'trees_json' )

(Our Graphlab version is 1.8.1)

We traced these trees and computed the predicted value, which is 0.03910099999999994 but model.predict gave 0.059862732887268066 (in this case we only have depth = 4 so the error rate is still small, but it is significant). We also have the Python code to parse the JSON trees but we cannot paste it here because it is quite complicated.

Could anyone please look into this and show us what happens with our method? Any help is greatly appreciated!

-- DefRet

Comments

User 19 | 2/29/2016, 7:01:29 PM

Hi DefRet,

Thanks for reporting this. Is there any chance you could include the Python code that you're using to parse the trees as well as the scala code you're using for prediction? This will help us debug the discrepancy.

Thanks, Chris


User 2124 | 3/1/2016, 2:05:03 AM

Hi Chris,

Thank you for your reply. I have refactored my code to extract the relevant parts and added some tests to it. Please find the Python code in the file gbdtdecoder.py in the attached zip file You can directly run this file to see the problem :

python gbdtdecoder.py (it requires only Graphlab and json module). I think it is enough to check only the Python code, we don't need to check the Scala code here because we can clearly see the discrepancies even with the Python code.

First, please see two simple tests in the first two rows of the gbdtdecodermain function : python gbdt_decoder_test('e1618.model', 'data1618.sframe') gbdt_decoder_test('e1692.model', 'data1692.sframe') These yield two discrepancies :

` GBDT predict : [0.059862732887268066] Our predict : [0.03910099999999994]

GBDT predict : [0.28399257361888885] Our predict : [0.1268450000000001] ` (you can see a large discrepancy in the second model).

If you want to see the traces when we calculate the scores, please modify line 302 to : python Config.is_debug = 1

If isdebug is set to 1, you can see the traces like this : ===================== 0:detected_vertical=v805=1,2,1 1:detected_vertical=v1231=1,4,3 3:detected_vertical=v1381=1,6,5 5:recency<2.000000,9,10 This means at node 0 (the root note), we have two branches (node 2 on the left and node 1 on the right). The condition is detectedvertical == 'v805' (this must be true to take the left branch of the tree). Similarly, at node 5, if recency is smaller than 2.0 then we take the left branch (go to node 9), otherwise we take right branch.

Finally, the errorratetest function actually predicts 6412 records and calculate the error rate (the number of records with discrepancy). Here is the result on my machine :

Total = 6412, error = 3, error rate = 0.000468

Thanks, DefRet.


User 1190 | 3/2/2016, 12:02:48 AM

Hi DefRet,

Can you check if the 20% of you data contain a lot of missing values? Also, is it possible to compare a singe tree model (e.g. max_iterations=1), and find an example for which the prediction differs considerably?

Thanks, -jay


User 2124 | 3/2/2016, 1:31:06 AM

Hi Jay,

Before input the data into the training/testing phase, we use SFrame.fillna on some fields and then SFrame.dropna for all other fields so we believe that the train/test data did not contain any missing value (before fillna, it contained, but after fillna, the value should be filled with empty string).

We tried to create a simple tree model with discrepancies but we could not find such example. As we reported in the previous post, the discrepancies appear when we increase the number of trees and the maxdepth. With maxdepth = 4 then the minimum value of maxiterations to get some discrepancies is 100. With maxdepth = 9 then the minimum value of max_iterations is 10 to get some discrepancies.

Anyway, I wonder why Graphlab does not provide a method to load the exported model in JSON (currently, Graphlab only has the Graphlab.load_model method to load a binary model). Can you provide a reference implementation of loading (and parsing) JSON tree models, as well as using the loaded model for prediction?. Is my implementation algorithmically valid?

Thanks, DefRet


User 1190 | 3/2/2016, 3:01:03 AM

Hi Defret,

Your looks valid as I didn't spot anything obviously wrong for the first pass. As you know, debug code by staring at it is probably not going to work. The key is to find out under what conditions does it trigger, and gradually narrows the space. Since you said it works on 80% of the data, naturally we want to find out what's in the 20% that makes it different. Can we take it further to get a noticeable discrepancy on a single example? It shouldn't be hard to write a script to find the example that causes the most difference. As for the model, it's better to debug with a tree than a forest. So depth=9 and iteration=10 makes it easier than 100 iterations.

The main purpose of json dump is for visualization, so we don't have a reference implementation to load it back. However, we are working on some new features for model exporting. Please stay in tuned.

Thanks, -jay


User 2124 | 3/2/2016, 7:19:33 AM

Hi Jay,

Actually I have already provided an example that makes large difference in the above post (please download the attached Python code in the post and run it once to see the difference) : ****** GBDT predict : [0.28399257361888885] ****** Our predict : [0.1268450000000001] The model has maxdepth of 7 and maxiterations of 100. I will create an example model that has maxiterations of 10 and maxdepth of 9 and send it to you soon.


User 2124 | 3/2/2016, 10:10:06 AM

Hi Jay,

I have created the example you request (max_depth = 9, iterations = 10). Please find in the attached file (gbdtdecoder3916.zip) the model, SFrame and the code you want. We also provide an IPython notebook for you to easily check the results.

Below are the details : - Train data : traindata3916.sframe - Test data : testdata3916.sframe - Model generated on my machine : e3916.model - The SFrame that contains only one entry with big discrepancy : errdata3916.sframe - You can start the IPython notebook by this command : cd /path/to/gbdtcoder3916 ipython notebook gbdtdecodercheck.ipynb

In case you cannot use IPython then simply copy and paste the following code (line by line) to your Python console to see the error (before doing that, cd to /path/to/gbdtcoder3916)

` import graphlab as gl from gbdtdecoder import GBDTDecoder

m = gl.loadmodel('e3916.model') df = gl.loadsframe('errdata3916.sframe') m.predict(df) # result : 0.3616206794977188 gd = GBDTDecoder.createfromgbdt(m) gd.predict(df) # result : 0.6402680000000001

if you want to verify the training process,

execute the following 4-5 times until you get the large error!

traindata = gl.loadsframe('traindata3916.sframe') testdata = gl.loadsframe('testdata3916.sframe')

feature_set1 = [ 'iscv', 'detectedvertical', 'advertiser_category', 'weekend', 'hour', 'recency' ]

m2 = gl.boostedtreesregression.create(traindata, target='iscv', maxiterations=10, maxdepth=9, rowsubsample = 1.0, columnsubsample = 1.0)

m2.predict(df) gd2 = GBDTDecoder.createfromgbdt(m2) gd2.predict(df) `

If you need any further information, please reply to this thread or drop me an email (I'm in a different timezone with you so it may take time if we only use forum to communicate).


User 1190 | 3/2/2016, 9:30:13 PM

Thanks, Defret. This is very helpful. I will take a look and get back to you.


User 2124 | 3/3/2016, 1:09:19 AM

Hi Jay,

Thank you so much. We are looking forward to seeing for your result.


User 1190 | 3/3/2016, 11:11:44 PM

Hi Defret,

I figure out the issue: integer feature comparison at boundary condition gets the wrong branch. I compared each tree scores between the binary model and json decoded model, and find the problem is in tree 6.

Decoded model takes the following path: 0:detected_vertical=v317=1,2,1 1:detected_vertical=v299=1,4,3 3:detected_vertical=v1311=1,8,7 8:recency<3.000000,15,16 16:recency<2883.000000,29,30 30:recency<2924.000000,57,58 57:recency<2915.000000,107,108 107:0.015467,_,_

Binary model takes a slightly different path: 0:detected_vertical=v317=1,2,1 1:detected_vertical=v299=1,4,3 3:detected_vertical=v1311=1,8,7 8:recency<3.000000,15,16 16:recency<2883.000000,29,30 30:recency<2924.000000,57,58 57:recency<2915.000000,107,108 108:0.172615,_,_

Here is the dumped json for tree 6 // vertices { "id" : 57, "name" : "recency", "type" : "integer", "value" : 2915 } // edges { "src" : 57, "dst" : 107, "value" : "yes" }, { "src" : 57, "dst" : 108, "value" : "no" },

Notice that they diverge on node "57" when comparing "regency" with the split condition 2915.0000. The data point has "regency" 2914, and the binary model representation has split value 2914.0000.

In summary, 1. When we export model into JSON format, float 2914.000 gets incorrectly rounded into int 2915 2. The error happens when the feature value is right at the boundary condition.

The bug fix will be included in the coming version. In the meantime, the workaround is to subtract 1 from the split value of an integer node.

Thanks for reporting the issue and provide us with easy way to reproduce. Best, -jay


User 2124 | 3/4/2016, 1:42:19 AM

Hi Jay,

Thank you so much for finding out the reason. We will immediately implement your proposed workaround and update back to you the result.

Thanks, Defret.


User 2124 | 3/4/2016, 2:45:51 AM

Hi Jay,

We have implemented the workaround and tested it. The workaround worked for the model and the SFrame that we sent you. However, for newly trained model (simply re-train the model), the workaround seems not work well. I attached the code with the workaround included (the workaround is in implemented in the fix_rounded_bug_helper function and is called inside the load_graphlab_json_tree method). Please download it and open the IPython Notebook to see the discrepancies (or the type the code below into your Python console).

`python import graphlab as gl from gbdtdecoder import GBDTDecoder

execute the following code 4-5 times until you get the large error!

traindata = gl.loadsframe('traindata3916.sframe') testdata = gl.loadsframe('testdata3916.sframe') m2 = gl.boostedtreesregression.create(traindata, target='iscv', maxiterations=10, maxdepth=9, rowsubsample = 1.0, columnsubsample = 1.0)

m2.predict(df) gd2 = GBDTDecoder.createfromgbdt(m2) gd2.predict(df) `

I think we still have other bugs in the boostedtreesregression.BoostedTreesRegression JSON export procedure. By the way, we have also tested some JSONs exported from boostedtreesclassifier.BoostedTreesClassifier (with predict(outputtype = 'probability') and confirmed that boostedtrees_classifier.BoostedTreesClassifier correctly exports its JSON. I hope that this will provide you a suggestion to fix the JSON export code in BoostedTreesRegression. Bests, Defret.


User 1190 | 3/6/2016, 7:48:32 PM

Hi Defret,

Thanks for your questions and feedbacks. After the same debug process with your new script and data, I notice that the workaround does not always apply. The split condition is stored as float and is not necessary able to cast to an integer without loss. In the coming version, I'm going to include the fix which export the float as is. I will test out the fix on your dataset and make sure all predictions are identical between the binary and the json version on the given test data.

Thank you very much! -jay


User 2124 | 3/7/2016, 8:54:37 AM

Hi Jay,

Thank you for the investigation. We look forward for the new version. Could you please provide a timeline for the next release?

Bests, Defret.


User 1190 | 3/7/2016, 6:52:38 PM

You can expect the coming release in a week or two. Let me know if the timeline works for you.


User 2124 | 3/8/2016, 1:45:54 AM

Hi Jay,

Yes, if in the next version the JSON export functionality is stable and produces nearly identical results with the binary model then the above timeline is fine for us.

Thank you, Defret.


User 2124 | 3/10/2016, 7:56:54 AM

Hi Jay,

Did you mean the fix is shipped in Graphlab version 1.8.4? I have just read the release note for 1.8.4 :

` Notable Bugfixes:

Fix inconsistent JSON export representation for decision tree method Previous serialization truncated float split value into integer, which led to prediction inconsistencies between the binary model and model based on exported JSON. `

However, after I upgraded to 1.8.4, the error seems still there. Would you please inform me when the fix is applied?

Thanks, Defret.


User 1190 | 3/10/2016, 10:04:09 PM

HI Defret,

Unfortunately the fix was missed in the 1.8.4 due to a bug in the build script for dependency tracking. If it is urgent, please contact support@dato.com and we will can work on a custom release for you.

Thanks, -jay


User 2124 | 3/14/2016, 1:46:06 AM

Hi Jay,

I have sent an email to your customer support team for a custom release. I also sent a private message to you to provide details about our license.

Best, Defret.