Compiler Optimizations


Extracting the cachesize from timing experiments on simple vector operations used to be straightforward. However, modern compilers have become darn good at recognizing those kernel operations, optimizing them, and sometimes replacing the loop you coded with a high-performance one from an internal library.

Each of the following graphs shows the performance (in Mflops/sec) of three different kernels:

Notice that the first two (saxpy and dotpxy) involve 2*n+1 data elements, while the third one (dotpxx) only involves n+1 data words. The same dotproduct function is used to compute both dotpxy and dotpxx, but the first is called with two different vectors x and y, while dotpxx is called with one vector argument repeated twice. Because of this, dotpxx uses only half as much memory data as saxpy and dotpxy.

The plots differ only in the compiler optimizations used, indicated in the heading given each one. Be aware that the range of the y-axis is not the same for all of the plots. All results are from a machine (hostname maya2) with 4 Gbytes of memory, a 4 Mbyte cache, and a 2.4 Ghertz Intel Core 2 Duo processor (not the "extreme" version, which has additional hyperthreading capabilities). The Intel 9.1 compiler was used on a C++ program with statically allocated arrays. Identical conclusions result when using a Fortran code with statically declared arrays.


Compiler Optimizations: -O0 on an Intel Core 2 Duo

[maya2_O0.png]

This almost matches the results of the load/store analysis: the ratios of memory references to flops are

so saxpy is expected to be slower. However, a dotproduct between two different vectors should be about half the speed of a dotproduct of a vector with itself. The performance above has them almost identical. What causes the periodic build up and drop of the performance for saxpy? Answer: I don't know. It may be related to memory page sizes being reached, or interactions between different levels of cache; it is not relevant to the questions being addressed.


Compiler Optimizations: -O1 on an Intel Core 2 Duo

[maya2_O1.png]

Now the rates are in the Gflop range. saxpy takes a nose dive before the dotproducts do, around n = 2000 which corresponds to the size of a vector of length 8*n = 16000 bytes. Since saxpy uses two vectors, the drop occurs roughly at 32k bytes, which is the size of the first level data cache. The largest vector size of n = 5000 involves 8*n = 40k bytes, too small to exceed the second level cachesize.


Compiler Optimizations: -O2 on an Intel Core 2 Duo

[maya2_O2.png]


Compiler Optimizations: -O3 on an Intel Core 2 Duo

[maya2_O3.png]

Optimization levels -O2 and -O3 are roughly the same as -O1.


Compiler Optimizations: -sse3 on an Intel Core 2 Duo

[maya2_sse3.png]

Using the on-chip vector instruction set SSE3 gives another big boost to performance, from 1.6 Gflop/second to about 4.5 Gflop/second. The SSE3 instruction set provides hardware pipelining and vectorization.


Compiler Optimizations: -mkl on an Intel Core 2 Duo

[maya2_mkl.png]

Strangely enough, Intel's MKL library does not do as well as simply compiling with -O3. The Math Kernel Libraries (MKL) are high-performance versions of several vector and matrix operations. MKL should be an example of what the BLAS were intended for: letting vendors provide high-performance implementations, without the user needing to change code from one machine to another.

The actual values vary (especially with the spikes downward that appear in the graphs above), but the overall conclusions are the same. Not only does the relative performance of the three vector operations change with the optimization used, but also the range of performance changes. Those scales extend from a low of 280 Mflop/sec with no optimization (-O0) up to about 4.6 Gflop/sec for the SSE3 and MKL library versions. Since the machine runs at 2.4 Ghertz, with one core it could get at most 2×2.4 ~ 4.8 Gflop/sec. So the last two runs achieved 4.6/4.8 = 96% of the theoretical peak performance.