error on distributed version while running asyn engine with factorized=false

User 344 | 8/13/2014, 5:03:34 AM

Hi!

I have implemented an algorithm that requires edge consistency, the algorithm is working up to 6000 edges and 3511. Everything runs fine. After that I got several communication errors, including the posted in a previous post: http://forum.graphlab.com/discussion/186/error-while-eusing-asynchronous-engine ( I already uncommented the parameters in rpc file); I found that the error occurs when in the line: 971, asynconsistentengine.hp many messages execute the"if (someoneelserunning)" in the following code:

getexclusiveaccesstovertex(const lvidtype lvid, const messagetype& msg) { vertexlocks[lvid].lock(); bool someoneelserunning = programrunning.setbit(lvid); if (someoneelserunning) { // bad. someone else is here. // drop it into the message array messages.add(lvid, msg); hasnext.setbit(lvid); } vertexlocks[lvid].unlock(); return !someoneelse_running; }

it seems that many messages are dropped..

Do you think it is an error with the data? As I mentioned, the algorithm runs fine the a subset of the data?

Could you provide any pointer or help? I appreciate your help. Thanks!

Comments

User 344 | 8/13/2014, 5:04:14 AM

I have test the code in nodes running MPICH2