Do jobs created by grid_search.create all run at the same time

User 2568 | 3/16/2016, 9:51:45 AM

I've set up a random search that has 100 combinations. Does this create 100 jobs in parallel, all competing for resources, or does it queue them in a small number of parallel threads?

WHen I run job.get_status() I get {'Canceled': 0, 'Completed': 0, 'Failed': 0, 'Pending': 0, 'Running': 100} which indicates they are all running at the same time.

Given that my server has 4 cores and 15GB memory, I'm concerned running all 100 at the same time will leads to excessive contention which may slow the result or cause the job to fail.

Comments

User 1190 | 3/16/2016, 6:02:59 PM

It should not run 100 at the same time. When doing parameter search locally, the get_status() may not be precise. If you look at "top" or "activity monitor", there should not be 100 python processes running.


User 2568 | 3/16/2016, 9:14:33 PM

After closer inspection, I can see what's happening. Using job.summary() I can see the parameter search sets up 4 jobs (one for each core I guess) and splits the tasks between them.

I can get more detail for the ith job using job.jobs[i], which shows the execution directory and the log files.

The output of job.get_status() is misleading and should be changed, fixed or at least properly documented.

at the moment its currently showing {'Canceled': 0, 'Completed': 1, 'Failed': 0, 'Pending': 0, 'Running': 99} and stays this way for a long time.

When I inspect the log files, I can see that job #3 only had one task and has completed. The other 99 tasks are split among jobs 0, 1 & 2. Even though many of the parameter searches (tasks) have been completed, get_status() does not show that. Instead it sums the number of tasks assigned to the running jobs, whether they are complete or not. That is what is misleading.

Twice now, I've run this job. Top shows its executing then after an hour or so the process are no longer running but job.get_results() does not return. In one case the notebook said the kernel had died. Now that I know where the logs are I can check and find out what happens.


User 2568 | 3/16/2016, 9:33:11 PM

I had the notebook kernel crash again. I think the jobs had all completed, so I ran job.get_status(). There was a long wait, my ssh terminal became unresponsive, then I got the kernel has died message.

When I go to /home/ec2-user/.graphlab/artifacts/results the logs for each of the 4 jobs looks fine.

I'm going to rerun the job with much smaller max iterations and see if I can quickly reproduce the problem


User 2568 | 3/16/2016, 10:25:00 PM

I reproduced the bug with maxiterations set to just 3. The error message is The kernel appears to have died. It will restart automatically. It looks like the bug is getstatus() and probably in get_result() but I've not tested the latter.

I've created a seperate bug report


User 1190 | 3/17/2016, 11:28:15 PM

There are three issues being reported here:

A Work load is not evenly distributed: 100 tasks -> 33, 33, 33, 1. B Status report is not accurate C Out of memory

A is easy to fix: In line 546 of graphlab/toolkits/modelparametersearch/modelparameter_search.py, change

batch_size = max(10,len(parameter_sets) / 3) into batch_size = max(10, math.ceil(len(parameter_sets) / 3.0)) We will include the fix in the next release.

B and C are not so easy to fix right now. We group the parameter sets in to batches. The maximum number of batches is set to be 3 (hence the divide by 3 above). This balances the degree of asynchronous and overhead of submitting job when executing remotely.

Currently there is no way to get fine-grained status information for a batch of tasks.

When modelparametersearch is running locally asynchronously, all batches are running together without queuing. Within the batch, tasks appear to run sequential. That said, the machine is simultaneously training 3 models at a time, and there has to be enough memory to support that.


User 2568 | 3/18/2016, 3:50:23 AM

Jay I've updated fix modelparameter_search.py. Where do I put it so it's not overwritten and lost by the next upgrade

A: I've fixed this in my local copy. I had to to cast this to an int otherwise there are errors later. B: There is a relatively simple way to calculate the task status and I"ve rewritten getstatus to do this. I've also added a progressbar() method, see example here

What is the process for getting this into a future update of GraphLab.


User 2568 | 3/18/2016, 3:52:15 AM

What is the process for getting this into a future update of GraphLab. I can't seem to attach the updated modelparameter_search.py file

I've put a copy in github


User 1190 | 3/19/2016, 8:34:35 PM

Thank you very much for your contribution. I will make sure the change is included in the next release!