From: Fault tolerance in computational grids: perspectives, challenges, and issues
System | Fault detection technique | Types of faults detected | Fault tolerance technique | Advantages | Disadvantages |
---|---|---|---|---|---|
Globus | Heartbeat monitor | Host failure, Network failure | Resubmit the failed job | Generic failure detection | Can not handle user defined exceptions |
MDS-2 | GRRP | Task crash failure | Retry | Task crash failure detection through protocols | Can not handle user defined exceptions |
Legion | Pinging | Task failure | Checkpoint recovery | Application level fault tolerance | Can not discern between task failure and network failure |
Condor-G Townend and Xu (2003) | Polling | Host crash, Network crash | Retry on same machine | Provides security, management of jobs, and fault tolerance | Retry on same machine, can not detect task crash failure |
NetSolve | Generic heartbeat mechanism | Host crash, task crash, and network failure | Retry on another available machine | Load balancing, heartbeat mechanism, Retry on another machine | Does not support diverse failure recovery mechanism |
CoG Kits Guimaraes and de Melo (2011) | N/A | N/A | N/A | Security, Discovery of resources, and management of resources | Failure detection is hard coded, Ignores fault tolerance |