Asynchronous Snapshot Algorithm
User 90 | 7/7/2015, 2:42:51 AM
The distributed GraphLab paper talks about the asynchronous snapshot algorithm based on Chandy-Lamport algorithm (section 4.3) to provide fault-tolerance with low checkpointing overheads. However, I cannot find any implementation related to this algorithm in the async engine (synchronous snapshotting is implemented in synchronous engine). Has it been removed from the publicly available source?
User 1189 | 7/8/2015, 5:57:55 PM
It was part of a much older source that is no longer available.
User 90 | 7/8/2015, 6:04:03 PM
Thanks Yucheng. Oddly for the cluster I have, the machines are failing more often. Does the asynchronous engine provide fault tolerance for this? I noticed that the process simply hangs (and sometimes dies) when a machine drops out.
User 1189 | 7/8/2015, 6:25:19 PM
The backend is "mpi-like" in nature. i.e. if one machine goes down, everything goes down.
User 90 | 7/8/2015, 7:58:10 PM