Document Type

Article

Language

eng

Publication Date

4-2014

Publisher

Institute of Electrical and Electronic Engineers (IEEE)

Source Publication

IEEE Transactions on Parallel and Distributed Systems

Source ISSN

1045-9219

Abstract

While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC-based procedure to draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the reliability and the key dependence of DTR policies on the type of correlated failures are also investigated.

Comments

Accepted version. IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 4 (April 2014): 1034-1043. DOI. This article is © Institute of Electrical and Electronic Engineers (IEEE). Used with permission.

Majeed M. Hayat was affiliated with University of New Mexico, Albuquerque at the time of publication.

Hayat_12978acc.docx (328 kB)
ADA Accessible Version

Share

COinS