Asynchronous Snapshot Algorithm

User 90 | 7/7/2015, 2:42:51 AM

The distributed GraphLab paper talks about the asynchronous snapshot algorithm based on Chandy-Lamport algorithm (section 4.3) to provide fault-tolerance with low checkpointing overheads. However, I cannot find any implementation related to this algorithm in the async engine (synchronous snapshotting is implemented in synchronous engine). Has it been removed from the publicly available source?

Comments

User 1189 | 7/8/2015, 5:57:55 PM

It was part of a much older source that is no longer available.


User 90 | 7/8/2015, 6:04:03 PM

Thanks Yucheng. Oddly for the cluster I have, the machines are failing more often. Does the asynchronous engine provide fault tolerance for this? I noticed that the process simply hangs (and sometimes dies) when a machine drops out.


User 1189 | 7/8/2015, 6:25:19 PM

The backend is "mpi-like" in nature. i.e. if one machine goes down, everything goes down.


User 90 | 7/8/2015, 7:58:10 PM

Okay. Thanks Yucheng.