Processes lock with celery, graphlab, and xml on Ubuntu 12.04

User 1132 | 12/29/2014, 8:56:42 PM

I'm having some trouble debugging a locking issue on ubuntu 12.04 with python 2.7.9. The issue has to do with using the xml library in a celery task. This call is being made in a library that uses graphlab, but the lock occurs without directly using graphlab-create.

Start a celery worker in our django project with only one child process,

<pre class="CodeBlock"><code> ./manage.py celery worker -c1 </code></pre> In this project we have a task that looks like the following,

<pre class="CodeBlock"><code> @sharedtask def testet(): import xml.etree.ElementTree as ET print('test_et') fpath = '{{path-to-valid-xml-on-disk}}' t = ET.parse(fpath) print('--- t', t) </code></pre> When this task is directly imported it runs fine, however when run as an async task it will lock trying to parse the xml. So far I've traced it to graphlab using gdb

Find the celery worker pid (".pool.processes")

<pre class="CodeBlock"><code> ./manage.py celery inspect stats </code></pre> Next attach to this process with gdb, then make a call to the test_et task,

<pre class="CodeBlock"><code> ./manage.py celery call app-name.tasks.test_et </code></pre>

<pre class="CodeBlock"><code> sudo gdb (gdb) attach {{pid}} (gdb) c continuing.

^c
program received signal sigint, interrupt.
0x00007f7ccdec90fe in pthread_cond_timedwait@@glibc_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) info threads
  id   target id         frame
* 1    thread 0x7f7cce2e8700 (lwp 6306) "python" 0x00007f7ccdec90fe in pthread_cond_timedwait@@glibc_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) step
single stepping until exit from function pthread_cond_timedwait@@glibc_2.3.2,
which has no line number information.
0x00007f7cb68e024f in boost::future_status boost::detail::basic_future<libfault::message_reply*>::wait_for<long, boost::ratio<1l, 1l> >(boost::chrono::duration<long, boost::ratio<1l, 1l> > const&) const ()
   from /opt/{{app}}/.venv/local/lib/python2.7/site-packages/graphlab/cython/libbase_dep.so
(gdb) step
single stepping until exit from function _znk5boost6detail12basic_futureipn8libfault13message_replyee8wait_forilns_5ratioill1ell1eeeeens_13future_statuserkns_6chrono8durationit_t0_ee,
which has no line number information.
0x00007f7cb68dbbee in cppipc::comm_client::internal_call(cppipc::call_message&, cppipc::reply_message&, bool) () from /opt/{{app}}/.venv/local/lib/python2.7/site-packages/graphlab/cython/libbase_dep.so</code></pre>

Notice the calls to graphlab/cython/libbase_dep.so. After seeing this I tried testing this with different versions of graphlab-create. 0.91 works and the task will run fine in the celery worker, but anything >= 1.0 doesn't.

I'm not sure what the issue is here or how to go about further debugging. It's likely some version incompatibility issue with the libraries on 12.04, since this works on OSX. I haven't tested this on any other ubuntu releases. If anyone has any ideas on what I can look at next, or what might help resolve this issue it would be greatly appreciated. Thanks.

Comments

User 940 | 12/29/2014, 11:14:07 PM

Hi Paul,

Thank you for bringing this to our attention! From a distance, it's hard to identify the culprit. One key suspect is interference between the xml parser and GraphLab. If you define a simpler task (without any xml parsing), does the problem persist?

In any case, it would be helpful to be able to reproduce it on our side. Could you send us a short code that reproduces the problem?

Cheers! -Piotr


User 1132 | 12/30/2014, 9:53:04 PM

Simpler tasks complete just fine. I've tried to put together a sample project that recreates the issue but haven't been successful in that. It must be something subtle in our application. With some additional testing I'm finding that it works on ubuntu 14.04. If I find a way to recreate this in the sample project, I'll let you know.