User 690 | 12/25/2014, 3:08:51 PM
Hi Everybody, In my previous question, I was told that machine learning toolkits like logistic_classifier are designed and implemented to run on datasets much larger than the machine memory size efficiently. Can I assume the same about the operations on SFrame? For example would join,groupby and unique function gracefully in case the size of the data exceeds the size of the memory? Do I need to watch out for something? I am asking this question, since I wanted to remove the preprocessing pig-script which would in turn feed data to the graphlab-create python code.. If SFrame can function gracefully, then I can reduce my code complexity. My cluster size on which I am running my pig-script isn't too large (10 machines) so the data is just of order of about 100 GB. I would love to hear advice in this regard. I also like the fact that most of the operations return a new SFrame leaving the original intact. But however, should I be concerned about the excessive memory usage? Can I force garbage collection on an earlier SFrame once I know that I am only interested in the transformed SFrame? Thanks, Sunil.