A simple puzzle to start with in P573

Last week I ran a simple "vector update" operation

y = y + α*x

where α is a double precision scalar and x, y are vectors of doubles of length n = 100 million. This kind of vector update operation is common in scientific computing, easily accounting for over 10% of the floating point operations done. The vector lengths may look excessive, but nowadays researchers are routinely working with systems 2-3 orders of magnitude larger.

The computer I used has

The last item is true for every general CPU made in the last 15 years, and it means that in theory the peak performance in terms of floating point operations (called "flops" from now on) is

Using just one core the maximum computational rate on my 3.2 Ghertz computer should be

For n = 100,000,000 = 100 million, the actual measured computational rate was So my code is running at of the possible peak performance. That is abysmal; why did it happen?


Some possible reasons why the vector update code is slow:

  1. Not enough memory

  2. Timing perturbations from other jobs

  3. Timing resolution and/or overhead of calling the timer

  4. The code was wrong.

  5. Professors can't write halfway decent code

  6. The relatively low performance is intrinsic in the operation itself, and will happen on any computer in any language using any standard compiler and with any reasonable implementation.


The last item is what causes the abysmal performance here. It seems impossible to establish this since it is a claim for all existant and future computers, but is actually pretty easy. Among other things, in this course you will develop the skills and tools to identify when the poor performance is intrinsic to the computational job, and if it isn't, how to go about improving things.