Posted in kerneltrap.org on December 25, 2009 – 3:43am

I am following an interesting class on Languages for Scientific Computing taught by Prof. Bientinesi. Just a week ago he explained about the theoretical peak performance of a processor that is none other than: n_cores * frequency * ops_per_cycle_per_cpu. But, that is not the practical one.

A processor often sits idle waiting for the data to come: No data, no processing. So, to get a high performance, the data must always be available when the processor needs it. Data, however, are stored in memory, but the type of memory that the processor has a direct access can only contain a very small amount of data because the memory is expensive. One of the ABCs of HPC (High-Performance Computing) is the following pyramid where the top of the pyramid is the most expensive but scarcest type of memory available while the bottom of the pyramid is the cheapest but plentiest type of memory as depicted here. Therefore, the peak performance of a processor then depends on the nature of the algorithm that it is performing.

As the professor said, the DGEMM (General Matrix Multiply) of BLAS (Basic Linear Algebra Subprograms) is a very commonly used algorithm that can pull its data very well close to the processor. This is because DGEMM computes C = C + A x B where $A, B, C \in \mathbb{R}^{n \times n}$ (i.e., A, B, C are square matrices) so that it requires $n^3$ multiplication operations for A x B and $n^3$ addition operations (i.e., $n^2 (n - 1)$ for A x B = T and $n^2$ for C + T) as well as $3 n^2$ memory reads (i.e., to read the cells of A, B and C) and $n^2$ memory writes (i.e., to write the cells of C). The cpu-to-memory activity ratio is $\frac{n^3 + n^3}{3 n^2 + n^2} = \frac{2n^3}{4n^2} = \frac{n}{2}$ that means that the bigger the data is, the busier the processor is because the number of operations performed compensates the cost of pulling the data from the lower types of memory. Therefore, DGEMM has the best practical peak performance.

As a consequence, DGEMM is used in the benchmark of supercomputers in the world, and therefore, it is very important to optimize the machine instructions running DGEMM for a specific architecture. For example, Kazushige Goto is famous for hand-optimizing the machine instructions as described here. Of course, one must also consider the cache size when crafting DGEMM code.

To conclude, if you want to know the peak performance of your CPU, just run DGEMM of BLAS.