Difference between pickling and saving a model?

User 2488 | 11/3/2015, 11:52:06 PM

From a high-level, I'm wondering if anyone can explain the difference between what happens under the hood when I pickle versus save a model I've trained and built using graphlab's algorithms. My specific use case is to try and get a sense of how big the models I'm building will be, so as to plan for future deployment, number servers needed, size of those servers, etc.

I barely know anything about how pickling works, while we're on the topic. As best as I can tell, pickling a model involves taking the entire object and converting it into a byte stream of characters. That is to say that it would do more than simply store the values of the coefficients (for a logistic regression classifier, say). If I am wrong here, I would definitely love some feedback.

Also, it is true that graphlab now supports pickling as per the following post, no? http://forum.dato.com/discussion/812/sframe-sgraph-and-sarray-cannot-be-pickled

Thanks, everyone.

Comments

User 1207 | 11/17/2015, 1:47:32 AM

Hello Chusteven,

Sorry for the delay in getting back to you -- somehow your post slipped by us.

The native python pickling routines do not actually work with many of our data structures. This is because most of them have files on disk, and the internal python pickling routines serialize the output into a byte stream that can be represented in memory. To get around this, we have internal pickling routines, which you reference in your above post, that take care of this and work with the file paths correctly. However, this does not necessarily serialize it in to a single file like the python ones do.

model.save(...), and other save routines, put all the files into a single directory, including all data and other references. Getting the size of the files in this directory is the serialized model size. On linux, it would be du -hs <model_dir>. (Note, however, that only some parts of a model may be loaded in to memory at a given time for prediction, so disk space is what you're worried about here.)

Hopefully that answers your questions!

-- Hoyt