Document Type

Conference Proceeding



Format of Original

5 p.

Publication Date



Institute of Electrical and Electronics Engineers (IEEE)

Source Publication

42nd International Conference on Parallel Processing (ICPP) 2013

Source ISSN


Original Item ID

doi: 10.1109/CLUSTER.2013.6702672


Boosting performance and energy efficiency of scientific applications running on high performance computing systems arise cruicially nowadays. Software and hardware based solutions for improving communication performance have been recognized as significant means of achieving performance gain and thus energy savings for such applications. As a fundamental component of most numerical linear algebra algorithms, improving performance and energy efficiency of distributed matrix multiplication is of major concerns. For such purposes, we propose a high performance communication scheme that fully exploits network bandwidth via non-blocking pipeline broadcast with tuned chunk size. Empirically, substantial performance gain up to 8.4% and energy savings up to 6.9% are achieved compared to blocking pipeline broadcast, and against binomial tree broadcast, performance gain up to 6.5% and energy savings up to 6.1% are observed on a 64-core cluster.


Accepted version. Published as part of the proceedings of the conference, 2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013: 1-5. DOI. © 2013 The Institute of Electrical and Electronics Engineers. Used with permission.

ge_5369acc.docx (192 kB)
ADA Accessible Version