Timing Simple Operations

Mostly we will be timing big, meaty chunks of computation, if nothing else by repeating operations many times within a timing block. Sometimes the problem is not clock resolution, but instead potentially overflowing the counter where the timer's data is held. Programs that take days to run are more the rule than the exception, and the parts you are really interested in are the ones that take the most time (well, duh).

Nevertheless it is instructive to figure out how to time really small, fast things like simple operations. In some cases of developing high performance code you will want to know when to trade off the multiplication 2*x with x+x, which in turn means knowing the relative costs of a multiply and a divide. [In practice doing the "operation reduction" of 2*x into x+x has not been worth doing for many, many years now. But myth and legend lay a heavy weight on the area of scientific computing, often propogated by CS professors who really should retire and yell at kids to stay off their lawn. However, you'll still see it done in some codes because it was useful at the time it was written.]

A simple operation like floating point addition will invariably take much less time than the clock resolution and overhead. If not, update your computer from the pre-1990 dinosaur you're apparently using. Doing several small operations in a loop has the problem that many other things are going on, which are of significant size compared to the simple operation.

For example, the loop index must be initialized, incremented, and tested on each iteration. Furthermore, there is a "branch" instruction at the heart of every loop (typically buried in the assembly language your code generates) - and branches can be far more expensive than a floating point add.

The key idea is to time two loops with lots of repetitions, each with the loop body containing the same statement replicated several times. If they differ in the number of repetitions, then subtracting the time for each will also subtract off the overhead time of setting up and managing the loop - branches and all. Here is the technique for addition:

   time_1 = get_current_time()

   for k = 1, ... , repetitions
       x = x + y 
         .
         .
         .
       x = x + y
   end for

   time_2 = get_current_time() 

   for k = 1, ... , repetitions
       x = x + y 
         .
         .
         .
       x = x + y 
   end for

   time_3 = get_current_time() 
where the first loop body has the operation replicated k1 times and the second has it replicated k2 times. Subtracting gives the number of operations:
        number_ops = repetitions*(k2 - k1),
where k2 > k1. The cost (in time) of doing the operation can then be found with some simple arithmetic on time_1, time_2, and time_3, but you should compute the formula yourself to help understand this methodology.

Problems to consider when doing this include:

  1. When making multiple timings of the above loops, what is the appropriate quantity to use: min, max, or average time?
  2. What values of repetitions, k1 and k2 should be used? I use 8 and 16, which seem to work well.
  3. What about potential overflow, since the operation is being repeated many times? If the initial values of x and y are fairly large above, this is a risk.
  4. Underflow is even more insidious than overflow. If a number is repeatedly scaled by 0.9, after a while it will be smaller than what can be represented by a IEEE normalized number. Most compilations are set up by default to just slip into denormalized numbers without warning you - and denormalized number arithmetic is handled by software, not hardware.
  5. What about pipelining (for those who know what that is)? The sample code above will prevent pipelining (on purpose), by making x appear on both sides of the assignment. Still, it is necessary to turn off all compiler optimizations when compiling a code for finding the time for simple ops.

The material on this page was originally written in 1992, but has been updated several times. Over that time, CPU hardware and run-time systems have gotten much better at handling the branching that occurs in any loop, using speculative execution, predictive branching, etc. So the claim about branching taking far longer than updating the loop counter may well be completely wrong by now. The techniques covered in this material about timing provide you with the knowledge you need to write a small toolkit that can experimentally determine the tradeoff costs. Don't trust vendor documentation or claims - it is far more reliable to write and run a small test code.