Electrical and Computer Engineering Faculty Research and Publications

Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation

Jorge E. Pezoa, University of New MexicoFollow
Sagar Dhakal, Naval Research Laboratory
Majeed M. Hayat, Marquette UniversityFollow

Document Type

Article

Language

eng

Publication Date

10-2010

Publisher

Institute of Electrical and Electronic Engineers (IEEE)

Source Publication

IEEE Transactions on Parallel and Distributed Systems

Source ISSN

1045-9219

Abstract

In distributed computing systems (DCSs) where server nodes can fail permanently with nonzero probability, the system performance can be assessed by means of the service reliability, defined as the probability of serving all the tasks queued in the DCS before all the nodes fail. This paper presents a rigorous probabilistic framework to analytically characterize the service reliability of a DCS in the presence of communication uncertainties and stochastic topological changes due to node deletions. The framework considers a system composed of heterogeneous nodes with stochastic service and failure times and a communication network imposing random tangible delays. The framework also permits arbitrarily specified, distributed load-balancing actions to be taken by the individual nodes in order to improve the service reliability. The presented analysis is based upon a novel use of the concept of stochastic regeneration, which is exploited to derive a system of difference-differential equations characterizing the service reliability. The theory is further utilized to optimize certain load-balancing policies for maximal service reliability; the optimization is carried out by means of an algorithm that scales linearly with the number of nodes in the system. The analytical model is validated using both Monte Carlo simulations and experimental data collected from a DCS testbed.

Comments

Accepted version. IEEE Transactions on Parallel and Distributed Systems Vol. 21, No.10 (October, 2010): 1531 – 1544. DOI. © 2010 Institute of Electrical and Electronic Engineers (IEEE). Used with permission.

Majeed M. Hayat was affiliated with University of New Mexico, Albuquerque at the time of publication.

Recommended Citation

Pezoa, Jorge E.; Dhakal, Sagar; and Hayat, Majeed M., "Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation" (2010). Electrical and Computer Engineering Faculty Research and Publications. 583.
https://epublications.marquette.edu/electric_fac/583

Download

Hayat_12999acc.docx (147 kB)
ADA Accessible Version

Find in your library

Included in

Computer Engineering Commons, Electrical and Computer Engineering Commons

COinS

e-Publications@Marquette

Electrical and Computer Engineering Faculty Research and Publications

Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation

Document Type

Language

Publication Date

Publisher

Source Publication

Source ISSN

Abstract

Comments

Recommended Citation

Included in

Browse

Information about e-Pubs@MU

Links

e-Publications@Marquette

Electrical and Computer Engineering Faculty Research and Publications

Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation

Authors

Document Type

Language

Publication Date

Publisher

Source Publication

Source ISSN

Abstract

Comments

Recommended Citation

Included in

Share

Browse

Information about e-Pubs@MU

Links