Timing Simple Operations

Mostly we will be timing big, meaty chunks of computation, if nothing else by repeating operations many times within a timing block. Sometimes the problem is not high resolution, but instead potentially overflowing the counter where the timer's data is held. Programs that take days to run are more the rule than the exception, and the parts you are really interested in are the ones that take the most time (duh).

Nevertheless it is instructive to figure out how to time really small, fast things like simple operations. In some cases of developing high performance code you will want to know when to trade off the multiplication 2*x with x+x, which in turn means knowing the relative costs of a multiply and a divide. [In fact, doing the "operation reduction" of 2*x into x+x has not been worth doing for many, many years now. But myth and legend lay a heavy weight on the area of scientific computing, so you'll still see some people doing it unnecessarily.]

A simple operation like floating point addition will invariably take much less time than the clock resolution and overhead. If not, update your computer from the pre-1990 dinosaur you're apparently using. Doing several small operations in a loop has the problem that many other things are going on, which are of significant size compared to the simple operation.

For example, the loop index must be initialized, incremented, and tested on each iteration. Furthermore, there is a "branch" instruction at the heart of every loop (typically buried in the assembly language your code generates) - and branches can be far more expensive than a floating point add.

The key idea is to time two loops with lots of repetitions, each with the loop body containing the same statement replicated several times. If they differ in the number of repetitions, then subtracting the time for each will also subtract off the overhead time of setting up and managing the loop - branches and all. Here is the technique for addition:

   time_1 = mytime()
   for k = 1, ... , repetitions
       x = x + y 
         .
         .
         .
       x = x + y
   End for
   time_2 = mytime() 
   for k = 1, ... , $ repetitions
       x = x + y 
         .
         .
         .
       x = x + y 
   End for
   time_3 = mytime() 
where the first loop body has the operation done k1 times and the second has it done k2 times. Then subtract to get the cost of doing
        repetitions*(k2 - k1)
of the simple operation, where k2 > k1.

Problems you need to handle when doing this:

  1. When making multiple timings of the above loops, what is the appropriate quantity to use: min, max, or average time?
  2. What values of repetitions, k1 and k2 should be used?
  3. What about potential overflow, since the operation is being repeated many times? Note that if we start with x and y fairly large above, this is a risk.
  4. You must also guard against underflow, which is even more insidious. If you are repeatedly scaling a number by 0.9, after a while it will be smaller than what can be represented by a IEEE normalized number. Most compilations are set up by default to just slip into denormalized numbers without warning you - and denormalized number arithmetic is handled by software, not hardware.
  5. What about pipelining (for those who know what that is)? Note that the sample code above will prevent pipelining (on purpose), by making x appear on both sides of the assignment.

Another point: this page was originally written in 1992, but has been updated several times. Over that time, CPU hardware has gotten much better at handling the branching that occurs in any loop, using speculative execution, predictive branching, etc. So the claim about it taking far longer than updating the loop counter, etc. may well be completely wrong. Anyone want to develop some code that can experimentally determine this? And no, I won't trust vendor documentation or claims - it needs to be demonstrable by crafting and running some test independent of the vendor.