How to use multiprocessing with Graphlab Create?

User 761 | 10/30/2014, 10:49:05 AM

Hey! I'm trying to parallelise cross-validation using the python multiprocessing module.

Here is a code snippet: pool = Pool(processes=3) #where 3 is the number of folds output = pool.map( worker,[sf,sf,sf] )

Where 'sf' is an SFrame containing the data and 'worker' is a function that executes one fold of cross-validation. However, I get the following error:

Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in bootstrap self.run() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run self.target(*self.args, **self.kwargs) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 102, in worker task = get() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 376, in get return recv() File "cysframe.pyx", line 51, in graphlab.cython.cysframe.UnitySFrameProxy.cinit TypeError: cinit() takes at least 1 positional argument (0 given)

Could the be related to a pickling issue? Maybe SFrames can't be pickled? Any help would be greatly appreciated.

Also, are there any better ways to go about parallelising the cross validation? How do you guys do it?

Thanks!

(Any plans to include a cross validation module in the future?)

Comments

User 91 | 10/30/2014, 4:14:38 PM

Unfortunately, we do not support cross validation currently. It is definitely something we will include in the future.

For now, there are 2 options for you. You could do the modelparametersearch which performs a grid search using a train-test split. In my experience, it works just as well as cross validation if you have a lot of data.

If you want to implement cross validation using multi processing, then you won't be able to pass the SFrame into the function because SFrames can't be picked for now. But you can work around it by loading the SFrame form a file path.

In this example, we are converting the function foo which takes in an SFrame directly to one that takes in a path of an SFrame. By explicitly saving and loading, you are avoiding pickling. Eventually, this will workout to be more efficient for you because you only need to serialize (same as pickle) once and then all your workers can load it.

def foo(sf): # Perform some operations with the SFrame return sf

def fooformultiprocessing(sfpath): sf = gl.SFrame(sfpath) # Perform some operations with the SFrame sf.save(sfpath)


User 761 | 10/31/2014, 6:01:07 AM

Hey srikrs! Thanks for the fast reply :D

"But you can work around it by loading the SFrame form a file path." That's exactly the workaround I was trying but I've been having trouble with that as well. Here's some code that should reproduce the problem

from graphlab import * from multiprocessing import Pool

def pleaseWork(path): print "Inside pleaseWork" sf=SFrame(path) print sf

sf1=SFrame({'4':['8'],'15':['16'],'23':['42']}) print sf1 sf1.save(absolute_path)

sf2=SFrame(absolute_path) print sf2

pool = Pool(processes=2) pool.map(pleaseWork,[path,path])

Here's my output

[INFO] Start server at: ipc:///tmp/graphlabserver-863 - Server binary: /Users/mihir.kale/lifeStage/virt/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1414735006.log [INFO] GraphLab Server Version: 1.0.1 +----+----+---+ | 15 | 23 | 4 | +----+----+---+ | 16 | 42 | 8 | +----+----+---+ [1 rows x 3 columns]

+----+----+---+ | 15 | 23 | 4 | +----+----+---+ | 16 | 42 | 8 | +----+----+---+ [1 rows x 3 columns]

Inside pleaseWork Inside pleaseWork

NOTE: The program doesn't terminate. It just gets stuck there. What am I missing? I"m sure it must be something stupid. Been stuck at this parallelising part for a while now....


User 91 | 11/1/2014, 1:03:40 AM

GraphLab is already using all the cores that you have, so doing cross validation one-by-one would probably be faster than trying to run multiple python processes.

Have you tried using GraphLab's modelparametersearch? It comes with built in ability to tune parameters of all models?


User 761 | 11/4/2014, 5:27:24 AM

Hey, I had taken a look at modelparametersearch but from the documentation it looks like its more of a grid-search method. For hyper parameter tuning I'm looking at a library called Spearmint that has a 'smarter' way of searching the hyperparameter grid than brute force. Moreover, I require cross-validation not just to tune the model but to make a final objective function from the objective functions of the models obtained from each individual fold of cross-validation (using averaging or some other technique) But its great to know that GraphLab already uses all the cores available! Thanks :smiley: Some more info about spearmint (if interested) github link: https -- //github.com/JasperSnoek/spearmint paper its based on -- http://arxiv.org/pdf/1206.2944v2.pdf


User 18 | 11/4/2014, 11:08:39 PM

Hi mihirkale815,

For cross-validation, what's important to you about getting an aggregate objective value? Are you mostly concerned with getting a confidence interval around that value?

Yes we are aware of spearmint. Smarter search can definitely be useful, but it also comes with a few hidden costs. One is that it is much less parallelizable, so it may save in terms of total computation cycles (because it finds the optimum with fewer model training/evaluations) but it may take longer in terms of wall clock time (because instead of getting ten machines to do the job in parallel, you have to wait for one machine to chug through all of it). Another hidden cost is that smart search methods generally have their own parameters (hyper-hyperparameters, I guess) that sometimes need to be tuned in order for the search itself to go well. How to tune those parameters is not always obvious. Thirdly, the search method itself can require a lot of computation to figure out what configuration to try next. If model training itself is very fast, then you might actually end up taking longer than just trying out a bunch of models without trying to be smart about it.

There was a very interesting paper that talks about the benefit of random search (a cheaper and randomized variant of grid search): http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf. It gives compelling evidence that, for a large number of tasks, one doesn't necessarily need to be too smart to find a good hyperparameter configuration. Instead of searching over the entire grid, just sample 60 configurations and take the max. This gets you within 5% of optimum 95% of the time.

With all of that said, we definitely plan to extend GLC's model search capabilities, probably starting with random search and then looking at "smarter" search methods.

Alice


User 18 | 11/5/2014, 12:12:35 AM

Oh wait! We already have random search. It's a little buried in the documentation. So we'll make a note to make this more clear.

maxnummodels : int, optional The max number of models to create. If maxnummodels is less than the number of possible combinations of hyperparams, maxnummodels of the possible hyperparams combinations will be randomly picked.


User 761 | 11/6/2014, 9:09:54 AM

Hey alicez,

That's very interesting. I didn't know a randomized grid search could be so effective. Will definitely experiment with maxnummodels and modelparametersearch

[Note: So my team (including me) is in love with graphlab and has decided to exclusively use it for the entire pipeline ("inspiration to production") I'm supposed to create a 'machine learning platform' for data scientists across the company. Its basically web UI wrapped around graphlab create. And I need to add custom modules for feature selection, scaling , L1 normalization etc (Many of these may become obsolete as more and more features are added to graphlab create) I need to add cross-validation and Spearmint features because they are explicit requirements from this internship project. So irrespective of anything else, I have to add these features so that others can then experiment with them]

Thanks for all the help and advice!


User 91 | 11/6/2014, 5:04:54 PM

Mihir,

Thanks for the feedback. We are excited to hear about your work!

There are several things that we have in our pipeline or in the product that can help your use cases.

We have been talking about (internally) many of the features you have proposed. We would love to hear more about your project and use cases.

Can you shoot us an email at srikris@graphlab.com? We would love to have a longer conversation with you and your team.


User 761 | 11/7/2014, 4:59:32 AM

Cool! Will send an email soon.