fiber_control affinity check is failing?

User 154 | 3/17/2014, 8:54:05 PM

Hi folks; I've got a build of graphlab from github as of Feb. 21 which I've modified to compile on our big SGI UV 'blacklight'. Running the pagerank example like:

mpirun -np 16 omplace -nt 2 ./pagerank --powerlaw=10000

I am getting the following error from each MPI process: ERROR: fiber_control.cpp(launch:266): Check failed: affinity.popcount()>0 [0 > 0]

GraphLab also seems to be spawning way more threads than I told it to- 2911 threads at the time the batch manager notices and kills it. Any suggestions on why the affinity.popcount() test is failing, or why it is starting almost 3000 threads instead of 16*2=32?

Thanks, -Joel

Comments

User 154 | 3/17/2014, 9:24:44 PM

The same thing happens if I remove the 'omplace -nt 2' clause from the command line, by the way.


User 20 | 3/17/2014, 11:42:05 PM

Blacklight is the large distributed shared memory machine isn't it? We create as many worker threads as there are CPUs, and I am suspecting we are detecting too many CPUs. We are using the following code to get the number of CPUs:

return sysconf(_SC_NPROCESSORS_CONF);

might it be returning the total number of CPUs in all of blacklight?

I do not know if there is a configuration option for omplace you can use, but a workaround is to modify the thread::cpucount function in pthreadtools.cpp to return a smaller number.


User 154 | 3/18/2014, 4:32:27 AM

Ah, that's exactly what is happening! The traditional fix is to check an environment variable, Hm; each MPI child process is running this test independently, correct? So the correct value would be equal to the number of cores available per MPI process?

On a multi-core system, do you recommend running one MPI process per network node and one thread per core on that node, or one MPI process per core on the overall system?


User 20 | 3/18/2014, 5:00:16 PM

I would recommend one MPI process per machine, and one thread per core. (there is a bug with --ncpus=N with the synchronous engine, so don't use that). If you change the code to read the number of cores from an environment variable, please send me a patch


User 154 | 3/21/2014, 10:13:37 PM

I applied the following patch and am having success. The choice of the environment variable name THREADSPERWORKER is completely ad hoc.

graphlab/parallel> diff pthreadtools.orig.cpp pthreadtools.cpp 163c163,173 < return sysconf(SCNPROCESSORS_CONF);


char* jobsStr = getenv("THREADS_PER_WORKER");
if (jobsStr) {
  int nThreads = atoi(jobsStr);
  if ( nThreads < 2 ) return 2; 
  else return nThreads;
}
else {
  return sysconf(_SC_NPROCESSORS_CONF);      
}

Thanks! -Joel


User 6 | 3/26/2014, 8:05:08 AM

Hi Joel, Based on your valuable feedback, I have merged your fix into the open source. The only change is that I renamed the environment variable to GRAPHLABTHREADSPER_WORKER. When you have time please pull the latest version from github and try it out.

Thanks!!!