SFrame operations

User 690 | 12/25/2014, 3:08:51 PM

Hi Everybody, In my previous question, I was told that machine learning toolkits like logistic_classifier are designed and implemented to run on datasets much larger than the machine memory size efficiently. Can I assume the same about the operations on SFrame? For example would join,groupby and unique function gracefully in case the size of the data exceeds the size of the memory? Do I need to watch out for something? I am asking this question, since I wanted to remove the preprocessing pig-script which would in turn feed data to the graphlab-create python code.. If SFrame can function gracefully, then I can reduce my code complexity. My cluster size on which I am running my pig-script isn't too large (10 machines) so the data is just of order of about 100 GB. I would love to hear advice in this regard. I also like the fact that most of the operations return a new SFrame leaving the original intact. But however, should I be concerned about the excessive memory usage? Can I force garbage collection on an earlier SFrame once I know that I am only interested in the transformed SFrame? Thanks, Sunil.

Comments

User 14 | 12/25/2014, 9:08:14 PM

Hi,

The short answer is yes. SFrame is designed for handling much larger data than can fit in memory. For more details, I encourage you to read this <a href="http://blog.graphlab.com/data-processing-architecture-of-graphlab-create">blog post</a> by our chief architect, Yucheng Low, on the design philosophy of the SFrame.

Since SFrame is stored on disk, you need to watch for your disk space (of the temporary directory, usually at /var/tmp/). Integers can be stored much more compact than string. Therefore, choosing column types wisely can dramatically reduce the processing time as well as disk load. You should be able to use SFrame with 100GB data, on a decent machine. The more memory the better, because we can cache more data in memory before flushing.

Garbage collection is usually not a problem, and relying on Python garbage collection is good enough. However, if your code has many references to unused SFrames, you are welcome to delete them. For memory usage, it is recommended to have 2GB per core, as most of the operation/algorithm is implemented with multicore parallelism.

Best, Jay