Fault tolerance in computational grids: perspectives, challenges, and issues

SpringerPlus

Table 1 Comparison of fault detection and tolerance techniques used in grids along with their advantages and disadvantages

System	Fault detection technique	Types of faults detected	Fault tolerance technique	Advantages	Disadvantages
Globus Buyya and Murshed (2002), Klutke et al. (2003)	Heartbeat monitor	Host failure, Network failure	Resubmit the failed job	Generic failure detection	Can not handle user defined exceptions
MDS-2 Buyya and Murshed (2002), Coulouris et al. (2001)	GRRP	Task crash failure	Retry	Task crash failure detection through protocols	Can not handle user defined exceptions
Legion Alvisi and Marzullo (1998), Hussain et al. (2006)	Pinging	Task failure	Checkpoint recovery	Application level fault tolerance	Can not discern between task failure and network failure
Condor-G Townend and Xu (2003)	Polling	Host crash, Network crash	Retry on same machine	Provides security, management of jobs, and fault tolerance	Retry on same machine, can not detect task crash failure
NetSolve Buyya and Murshed (2002), de Lemos (2006)	Generic heartbeat mechanism	Host crash, task crash, and network failure	Retry on another available machine	Load balancing, heartbeat mechanism, Retry on another machine	Does not support diverse failure recovery mechanism
CoG Kits Guimaraes and de Melo (2011)	N/A	N/A	N/A	Security, Discovery of resources, and management of resources	Failure detection is hard coded, Ignores fault tolerance