Each of the following graphs shows the performance (in Mflops/sec) of three different kernels:
The plots differ only in the compiler optimizations used, indicated in the heading given each one. Be aware that the range of the y-axis is not the same for all of the plots. All results are from a machine (hostname maya2) with 4 Gbytes of memory, a 4 Mbyte cache, and a 2.4 Ghertz Intel Core 2 Duo processor (not the "extreme" version, which has additional hyperthreading capabilities). The Intel 9.1 compiler was used on a C++ program with statically allocated arrays. Identical conclusions result when using a Fortran code with statically declared arrays.
This almost matches the results of the load/store analysis: the ratios of memory references to flops are
Now the rates are in the Gflop range. saxpy takes a nose dive before the dotproducts do, around n = 2000 which corresponds to the size of a vector of length 8*n = 16000 bytes. Since saxpy uses two vectors, the drop occurs roughly at 32k bytes, which is the size of the first level data cache. The largest vector size of n = 5000 involves 8*n = 40k bytes, too small to exceed the second level cachesize.
Optimization levels -O2 and -O3 are roughly the same as -O1.
Using the on-chip vector instruction set SSE3 gives another big boost to performance, from 1.6 Gflop/second to about 4.5 Gflop/second. The SSE3 instruction set provides hardware pipelining and vectorization.
Strangely enough, Intel's MKL library does not do as well as simply compiling with -O3. The Math Kernel Libraries (MKL) are high-performance versions of several vector and matrix operations. MKL should be an example of what the BLAS were intended for: letting vendors provide high-performance implementations, without the user needing to change code from one machine to another.
The actual values vary (especially with the spikes downward that appear in the graphs above), but the overall conclusions are the same. Not only does the relative performance of the three vector operations change with the optimization used, but also the range of performance changes. Those scales extend from a low of 280 Mflop/sec with no optimization (-O0) up to about 4.6 Gflop/sec for the SSE3 and MKL library versions. Since the machine runs at 2.4 Ghertz, with one core it could get at most 2×2.4 ~ 4.8 Gflop/sec. So the last two runs achieved 4.6/4.8 = 96% of the theoretical peak performance.