Optimization of BLAS on the cell processor

5Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The unique architecture of the heterogeneous multi-core Cell processor offers great potential for high performance computing. It offers features such as high memory bandwidth using DMA, user managed local stores and SIMD architecture. In this paper, we present strategies for leveraging these features to develop a high performance BLAS library. We propose techniques to partition and distribute data across SPEs for handling DMA efficiently. We show that suitable pre-processing of data leads to significant performance improvements when the data is unaligned. In addition, we use a combination of two kernels - a specialized high performance kernel for the more frequently occurring cases and a generic kernel for handling boundary cases - to obtain better performance. Using these techniques for double precision, we obtain up to 70-80% of peak performance for different memory bandwidth bound level 1 and 2 routines and up to 80-90% for computation bound level 3 routines. © 2008 Springer Berlin Heidelberg.

Cite

CITATION STYLE

APA

Saxena, V., Agrawal, P., Sabharwal, Y., Garg, V. K., Kuruvilla, V. A., & Gunnels, J. A. (2008). Optimization of BLAS on the cell processor. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5374 LNCS, pp. 18–29). Springer Verlag. https://doi.org/10.1007/978-3-540-89894-8_6

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free