Asynchronous Snapshot Algorithm

User 90 | 7/7/2015, 2:42:51 AM

The distributed GraphLab paper talks about the asynchronous snapshot algorithm based on Chandy-Lamport algorithm (section 4.3) to provide fault-tolerance with low checkpointing overheads. However, I cannot find any implementation related to this algorithm in the async engine (synchronous snapshotting is implemented in synchronous engine). Has it been removed from the publicly available source?


User 1189 | 7/8/2015, 5:57:55 PM

It was part of a much older source that is no longer available.

User 90 | 7/8/2015, 6:04:03 PM

Thanks Yucheng. Oddly for the cluster I have, the machines are failing more often. Does the asynchronous engine provide fault tolerance for this? I noticed that the process simply hangs (and sometimes dies) when a machine drops out.

User 1189 | 7/8/2015, 6:25:19 PM

The backend is "mpi-like" in nature. i.e. if one machine goes down, everything goes down.

User 90 | 7/8/2015, 7:58:10 PM

Okay. Thanks Yucheng.