Document Type

Conference Proceeding



Format of Original

5 p.

Publication Date



Institute of Electrical and Electronics Engineers (IEEE)

Source Publication

42nd International Conference on Parallel Processing (ICPP) 2013

Source ISSN


Original Item ID

doi: 10.1109/CLUSTER.2013.6702672


Boosting performance and energy efficiency of scientific applications running on high performance computing systems arise cruicially [sic] nowadays. Software and hardware based solutions for improving communication performance have been recognized as significant means of achieving performance gain and thus energy savings for such applications. As a fundamental component of most numerical linear algebra algorithms, improving performance and energy efficiency of distributed matrix multiplication is of major concerns. For such purposes, we propose a high performance communication scheme that fully exploits network bandwidth via non-blocking pipeline broadcast with tuned chunk size. Empirically, substantial performance gain up to 8.4% and energy savings up to 6.9% are achieved compared to blocking pipeline broadcast, and against binomial tree broadcast, performance gain up to 6.5% and energy savings up to 6.1% are observed on a 64-core cluster.


Accepted version. Published as part of the proceedings of the conference, 2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013: 1-5. DOI. © 2013 The Institute of Electrical and Electronics Engineers. Used with permission.

ge_5369acc.docx (192 kB)
ADA Accessible Version