Affaan M, Ansari M (2006) Distributed fault management for computational grids. In: Fifth international conference grid and cooperative computing, GCC 2006, pp 363–368
Alvisi L, Marzullo K (1998) Message logging: pessimistic, optimistic, causal, and optimal. IEEE Trans Softw Eng 24(2):149–159
Article
Google Scholar
Ammendola R, Biagioni A, Frezza O, Cicero FL, Lonardo A, Paolucci PS, Rossetti D, Simula F, Tosoratto L, Vicini P (2015) A hierarchical watchdog mechanism for systemic fault awareness on distributed systems. Future Gener Comput Syst 53:90–99
Article
Google Scholar
Amoon M (2012) A fault-tolerant scheduling system for computational grids. Comput Electr Eng 38(2):399–412
Article
ADS
Google Scholar
Arshad N (2006) A planning-based approach to failure recovery in distributed systems. PhD thesis, University of Colorado
Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dependable Secur Comput 1(1):11–33
Article
Google Scholar
Bahman arasteh, ZadahmadJafarlou M, Hosseini MJ (2012) A dynamic and reliable failure detection and failure recovery services in the grid systems. In: Park JJ, Chao HC, Obaidat MS, Kim J (eds) Computer science and convergence. Lecture notes in electrical engineering (LNEE), vol 114. Springer, Dordrecht, pp 497–509
Baker M, Buyya R, Laforenza D (2002) Grids and grid technologies for wide-area distributed computing. Softw Pract Exp 32(15):1437–1466
Article
MATH
Google Scholar
Balaji P, Buntinas D, Kimpe D (2012) Scalable computing and communications: theory and practice. In: Zomaya Y, Khan SU Wang L (eds) Fault tolerance techniques for scalable computing. John Wiley & Sons Publishing, Hoboken, New Jersey
Google Scholar
Baldoni R, H´elary J-M, Piergiovanni ST (2007) A component-based methodology to design arbitrary failure detectors for distributed protocols. In: 10th IEEE international symposium on object and component-oriented real-time distributed computing, ISORC’07, pp 51–61
Baldoni R, H´elary J-M, Piergiovanni ST (2008) A methodology to design arbitrary failure detectors for distributed protocols. J Syst Archit 54(7):619–637
Article
Google Scholar
Benjamin Khoo B, Veeravalli B (2010) Pro-active failure handling mechanisms for scheduling in grid computing environments. J Parallel Distrib Comput 70(3):189–200
Article
MATH
Google Scholar
Bhagyashree A, Pradeep D, Jayanthy N, Mounica K, Nivejaa S, Saranya Dharani P (2010) A hierarchical fault detection and recovery in a computational grid using watchdog timers. In: 2010 International conference on communication and computational intelligence (INCOCCI), pp 467–471
Bheevgade M, Patrikar RM (2008) Implementation of watch dog timer for fault tolerant computing on cluster server. World Acad Sci Eng Technol 38:265–268
Google Scholar
Bliss NT, Kepner J (2007) pMATLAB parallel MATLAB library. Int J High Perf Comput Appl 21(3):336–359
Article
Google Scholar
Bouguerra M-S, Trystram D, Wagner F (2013) Complexity analysis of checkpoint scheduling with variable costs. IEEE Trans Comput 62(6):1269–1275
Article
MathSciNet
Google Scholar
Brodie M, Rish I, Ma S, Odintsova N, Beygelzimer A (2003) Active probing strategies for problem diagnosis in distributed systems. In: IJCAI, pp 1337–1338
Buyya R, Murshed M (2002) Gridsim: a toolkit for the modelling and simulation of distributed resource management and scheduling for grid computing. Concurr Comput Pract Exp 14(13–15):1175–1220
Article
MATH
Google Scholar
Calado J, da Costa JS (2006) Fuzzy neural networks applied to fault diagnosis. In: Computational intelligence in fault diagnosis. Springer, pp 305–334
Carlsson C, Fuller R (2010) Predictive probabilistic and predictive possibilistic models used for risk assessment of SLAs in grid computing. In: Hüllermeier E, Kruse R, Hoffmann F (eds) IPMU 2010: Information processing and management of uncertainty in knowledge-based systems. Applications: 13th international conference, IPMU 2010, Dortmund, Germany, June 28–July 2, 2010. Proceedings, Communications in computer and information science (CCIS), vol 81. Springer, Heidelberg, pp 747–757
Chan PPW, Lyu MR, Malek M (2007) Reliable web services: methodology, experiment and modelling. In: IEEE International conference on web services, ICWS 2007, pp 679–686
Charoenpornwattana K, Leangsuksun C, Tikotekar A, Vall´ee GR, Scott SL, Chen X, Eckart B, He X, Engelmann C, Sun X-H et al. (2008) A neural networks approach for intelligent fault prediction in hpc environments. In: Proceedings of the high availability and performance computing workshop, Denver, Colorado
Christer Carlsson RF (2011) Risk assessment in grid computing. Possibility Decis Stud Fuzziness Soft Comput 270:145–165
Article
Google Scholar
Cotroneo D, Natella R, Russo S, Scippacercola F (2013) State-driven testing of distributed systems. In: International conference on principles of distributed systems. Springer International Publishing, pp 114–128
Coulouris G, Dollimore J, Kindberg T (2001) Distributed systems: concepts and design, 3rd edition. Paragraph 15:4
MATH
Google Scholar
Czajkowski K, Fitzgerald S, Foster I, Kesselman C (2001) Grid information services for distributed resource sharing. In: Proceedings of the 10th IEEE international symposium on high performance distributed computing, pp 181–194
Das A, De Sarkar A (2012) On fault tolerance of resources in computational grids. Int J Grid Comput Appl 3(3):1
Google Scholar
de Lemos R (2006) Adaptability and fault tolerance. In: International conference on software engineering, ICSE workshop on software engineering for adaptive and self-managing systems (SEAMS). 21–22 May 2006
Delporte-Gallet C, Fauconnier H, Freiling FC (2005) Revisiting failure detection and consensus in omission failure environments. In: Hung DV, Wirsing M (eds) International colloquium on theoretical aspects of computing. Theoretical aspects of computing–ICTAC 2005. Lecture notes in computer science (LNCS), vol 3722. Springer, Heidelberg, pp 394–408
Egwutuoha IP (2014) A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud. PhD thesis, University of Sydney
Egwutuoha IP, Levy D, Selic B, Chen S (2013) A survey of fault tolerance mechanisms and check-point/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326
Article
Google Scholar
Engelmann C, Vallee GR, Naughton T, Scott SL (2009) Proactive fault tolerance using pre-emptive migration. In: 2009 17th IEEE euromicro international conference on parallel, distributed and network-based processing, pp 252–257
Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J High Perform Comput Appl 15(3):200–222
Article
Google Scholar
Foster H, Uchitel S, Magee J, Kramer J (2003) Model-based verification of web service compositions. In: Proceedings of the 18th IEEE international conference on automated software engineering, pp 152–161
Frey J, Tannenbaum T, Livny M, Foster I, Tuecke S (2002) Condor-g: a computation management agent for multi-institutional grids. Clust Comput 5(3):237–246
Article
Google Scholar
Fugini MG, Pernici B, Ramoni F (2009) Quality analysis of composed services through fault injection. Inf Syst Front 11(3):227–239
Article
Google Scholar
Gallet M, Yigitbasi N, Javadi B, Kondo D, Iosup A, Epema D (2010) A model for space-correlated failures in large-scale distributed systems. In: D’Ambra P, Guarracino M, Talia D (eds) Euro-Par 2010, Part 1. LNCS, vol 6271. Springer, Heidelberg, pp 88–100
Ganga K, Karthik S, Paul AC (2012) A survey on fault tolerance in workflow management and scheduling. IJARCET 1(8):176
Google Scholar
Garg R, Singh AK (2011) Fault tolerance in grid computing: state of the art and open issues. Int J Comput Sci Eng Surv 2(1):88
Article
Google Scholar
Garg S, van Moorsel A, Vaidyanathan K, Trivedi KS (1998) A methodology for detection and estimation of software aging. In: Proceedings of the ninth international symposium on software reliability engineering, pp 283–292
Ghosh S, Mathur AP, Horgan JR, Li JJ, Wong WE (1997) Software fault injection testing on a distributed system–a case study. In: Proceedings of the 1st international quality week Europe, Brussels, Belgium
Gokuldev S, Valarmathi M (2013) Fault tolerant system for computational and service grid. Int J Eng Innovative Technol 2(10):236–240
Google Scholar
Grimshaw AS, Wulf WA et al (1997) The legion vision of a worldwide virtual computer. Commun ACM 40(1):39–45
Article
Google Scholar
Gu Y, Wu CQ, Liu X, Yu D (2013) Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. J Grid Comput 11(3):361–379
Article
Google Scholar
Guimaraes FP, de Melo ACMA (2011) User-defined adaptive fault-tolerant execution of workflows in the grid. In: 2011 IEEE 11th international conference on computer and information technology (CIT), pp 356–362
Guimaraes FP, C’elestin P, Batista DM, Rodrigues GN, de Melo ACMA, (2013) A framework for adaptive fault-tolerant execution of workflows in the grid: empirical and theoretical analysis. J Grid Comput 12(1):127–151
Google Scholar
Gullapalli S, Czajkowski K, Kesselman C, Fitzgerald S (2001) The grid notification framework. Global grid forum, Draft GWD-GIS-019
Gurer DW, Khan I, Ogier R, Keffer R (1996) An artificial intelligence approach to network fault management. SRI International, Menlo Park, CA, USA
Haider S (2007) Component based proactive fault tolerant scheduling through cross-layer design in computational grid. Master’s thesis, Federal Urdu University of Arts, Science and Technology, Islamabad, Pakistan
Haider S, Ansari NR (2012) Temperature based fault forecasting in computer clusters. In: IEEE 15th international multi topic conference (IEEE-INMIC), pp 69–77
Haider S, Imran M, Niaz I, Ullah S, Ansari M (2007) Component based proactive fault tolerant scheduling in computational grid. In: IEEE international conference on emerging technologies, (IEEE-ICET) 2007, pp 119–124
Haider S, Ansari NR, Akbar M, Perwez MR, Ghori KM (2011) Fault tolerance in distributed paradigms. In: Proceedings of fifth international conference on computer communication and management. IACSIT Press, Singapore
Hsueh MC, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. Computer 30(4):75–82
Article
Google Scholar
Hussain N, Ansari M, Yasin M, Rauf A, Haider S (2006) Fault tolerance using parallel shadow image servers (psis) in grid based computing environment. In: 2006 International conference on emerging technologies, IEEE, pp 703–707
Hussain H, Malik SUR, Hameed A, Khan SU, Bickler G, Min-Allah N, Qureshi MB, Zhang L, Yongji W, Ghani N et al (2013) A survey on resource allocation in high performance distributed computing systems. Parallel Comput 39(11):709–736
Article
MathSciNet
Google Scholar
Hwang S, Kesselman C (2003) A flexible framework for fault tolerance in the grid. J Grid Comput 1(3):251–272
Article
MATH
Google Scholar
Joshi KR, Hiltunen MA, Sanders WH, Schlichting RD (2011) Probabilistic model-driven recovery in distributed systems. IEEE Trans Dependable Secur Comput 8(6):913–928
Article
Google Scholar
Katzela I (1996) Fault diagnosis in telecommunication networks. PhD thesis, Columbia University
Kepner J, Ahalt S (2004) Matlab MPI. J Parallel Distrib Comput 64(8):997–1005
Article
Google Scholar
Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid computing system. Comput Electr Eng 36(6):1110–1122
Article
MATH
Google Scholar
Klutke G-A, Kiessler PC, Wortman M (2003) A critical look at the bathtub curve. IEEE Trans Reliab 52(1):125–129
Article
Google Scholar
Kondo D, Javadi B, Iosup A, Epema D (2010) The failure trace archive: enabling comparative analysis of failures in diverse distributed systems. In: 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGrid), pp 398–407
Latchoumy P, Khader PSA (2011) Survey on fault tolerance in grid computing. IJCSES 2:97
Article
Google Scholar
Latchoumy P, Khader PSA (2012) Fault tolerant advance reservation-based scheduling in computational grid. Eur J Sci Res 89(3):409–421
Google Scholar
Li H, Groep D, Wolters L, Templon J (2006) Job failure analysis and its implications in a large-scale production grid. In: Second IEEE international conference on e-science and grid computing, 2006, pp 27–27
Litvinova A, Engelmann C, Scott SL (2009) A proactive fault tolerance framework for high-performance computing. In: Proceedings of the 9th IASTED international conference, vol 676, p 105
Malik S, Nazir B, Qureshi K, Khan IA (2012) A reliable checkpoint storage strategy for grid. Computing 95(7):611–632
Article
Google Scholar
Massie ML, Chun BN, Culler DE (2004) The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput 30(7):817–840
Article
Google Scholar
Medeiros R, Cirne W, Brasileiro F, Sauv´e J (2003) Faults in grids: Why are they so bad and what can be done about it? In: IEEE proceedings of the fourth international workshop on grid computing, pp 18–24
Moon Y-H, Youn C-H (2015) Multi hybrid job scheduling for fault-tolerant distributed computing in policy-constrained resource networks. Comput Netw 82:81–95
Article
Google Scholar
Nagarajan AB, Mueller F, Engelmann C, Scott SL (2007) Proactive fault tolerance for hpc with xen virtualization. In: Proceedings of the 21st annual international conference on supercomputing, ACM, pp 23–32
Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economy based grid. J Supercomput 50(1):1–18
Article
Google Scholar
Nazir B, Qureshi K, Manuel P (2012) Replication based fault tolerant job scheduling strategy for economy driven grid. J Supercomput 62(2):855–873
Article
Google Scholar
Nguyen-Tuong A (2000) Integrating fault-tolerance techniques in grid applications. PhD thesis, University of Virginia
Qin F (2012) QoS-constrained resource scheduling in grid computing. In: Proceedings of the international conference on information engineering and applications (IEA) 2012, pp 407–414. Springer
Qureshi K, Khan FG, Manuel P, Nazir B (2011) A hybrid fault tolerance technique in grid computing system. J Supercomput 56(1):106–128
Article
Google Scholar
Rajachandrasekar R, Besseron X, Panda DK (2012) Monitoring and predicting hardware failures in hpc clusters with ftb-ipmi. In: 2012 IEEE 26th international parallel and distributed processing symposium workshops & PhD forum (IPDPSW), pp 1136–1143
Roohi Shabrin S, Devi Prasad B, Prabu D, Pallavi RS, Revathi P (2006) Memory leak detection in distributed system. In: Proceedings of world academy of science, engineering and technology, vol 16, pp 1307–6884
Schroeder B, Gibson G et al (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secur Comput 7(4):337–350
Article
Google Scholar
Selic B (2004) Fault tolerance techniques for distributed systems. IBM developers manual
Sethi AS et al (2004) A survey of fault localization techniques in computer networks. Sci Comput Progr 53(2):165–194
Article
MathSciNet
MATH
Google Scholar
Sharma Gaurav, Martin Jos (2009) MATLAB®: a language for parallel computing. Int J Parallel Progr 37(1):3–36
Article
MATH
Google Scholar
Shestak V, Chong EK, Maciejewski AA, Siegel HJ (2012) Probabilistic resource allocation in heterogeneous distributed systems with random failures. J Parallel Distrib Comput 72(10):1186–1194
Article
MATH
Google Scholar
Sistla AP, Welch JL (1989) Efficient distributed recovery using message logging. In: Proceedings of the eighth annual ACM symposium on principles of distributed computing, pp 223–238
Siva Sathya S, Syam Babu K (2010) Survey of fault tolerant techniques for grid. Comput Sci Rev 4(2):101–120
Article
Google Scholar
Stelling P, DeMatteis C, Foster I, Kesselman C, Lee C, von Laszewski G (1999) A fault detection service for wide area distributed computations. Clust Comput 2(2):117–128
Article
Google Scholar
Sun D, Chang G, Miao C, Wang X (2013) Analyzing, modelling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments. J Supercomput 66(1):193–228
Article
Google Scholar
Tang B, He H, Fedak G (2015) HybridMR: a new approach for hybrid map reduce combining desktop grid and cloud infrastructures. Concurr Comput Pract Exp 27(6):4140–4155
Article
Google Scholar
Townend P, Xu J (2003) Fault tolerance within a grid environment. Timeout 1(S2):S3
Google Scholar
Trefethen AE, Menon VS, Chang CC, Czajkowski G, Myers C, Trefethen LN (1996) MultiMATLAB: MATLAB on multiple processors. Cornell University, Ithaca
Google Scholar
Trodhandl C, Weiss B (2008) A concept for hybrid fault injection in distributed systems. TAIC PART Publishing, Windsor
Google Scholar
Vaidyanathan K, Trivedi KS (2001) Extended classification of software faults based on aging. Duke University, Durham
Google Scholar
Valentini GL, Lassonde W, Khan SU, Min-Allah N, Madani SA, Li J, Zhang L, Wang L, Ghani N, Kolodziej J et al (2013) An overview of energy efficiency techniques in cluster computing systems. Clust Comput 16(1):3–15
Article
Google Scholar
Vallee G, Engelmann C, Tikotekar A, Naughton T, Charoenpornwattana K, Leangsuksun C, Scott SL (2008) A framework for proactive fault tolerance. In: Third international conference on availability, reliability and security, ARES 08, pp 659–664
Viktors B (2002) Fundamentals of grid computing. IBM redbooks paper, pp 1–28
Von Laszewski G, Foster I, Gawor J (2000) Cog kits: a bridge between commodity distributed computing and high-performance grids. In: Proceedings of the ACM 2000 conference on Java Grande, pp 97–106
Wei-Tek T, Ray P, Lian Y, Saimi A, Zhibin C (2003) Scenario-based web services testing with distributed agents. IEICE Trans Inf Syst 86(10):2130–2144
Google Scholar
Yu J, Buyya R (2005) A taxonomy of workflow management systems for grid computing. J Grid Comput 3(3–4):171–200
Article
Google Scholar
Zhang Y, Mandal A, Koelbel C, Cooper K (2009) Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: 9th IEEE/ACM international symposium on cluster computing and the grid, CCGRID’09, pp 244–251
Zheng Z, Lyu MR (2008) A distributed replication strategy evaluation and selection framework for fault tolerant web services. In: IEEE International conference on web services, ICWS’08, pp 145–152
Zheng Z, Lyu MR (20098) A qos-aware fault tolerant middleware for dependable service composition. In: IEEE/IFIP international conference on dependable systems & networks, DSN’09, pp 239–248
Zheng Z, Lyu MR (2010) An adaptive qos-aware fault tolerance strategy for web services. Empir Softw Eng 15(4):323–345
Article
Google Scholar
Zhou Y, Lakamraju V, Koren I, Krishna CM (2007) Software-based failure detection and recovery in programmable network interfaces. IEEE Trans Parallel Distrib Syst 18(11):1539–1550
Article
Google Scholar