Nevertheless it is instructive to figure out how to time really small, fast things like simple operations. In some cases of developing high performance code you will want to know when to trade off the multiplication 2*x with x+x, which in turn means knowing the relative costs of a multiply and a divide. [In fact, doing the "operation reduction" of 2*x into x+x has not been worth doing for many, many years now. But myth and legend lay a heavy weight on the area of scientific computing, so you'll still see some people doing it unnecessarily.]
A simple operation like floating point addition will invariably take much less time than the clock resolution and overhead. If not, update your computer from the pre-1990 dinosaur you're apparently using. Doing several small operations in a loop has the problem that many other things are going on, which are of significant size compared to the simple operation.
For example, the loop index must be initialized, incremented, and tested on each iteration. Furthermore, there is a "branch" instruction at the heart of every loop (typically buried in the assembly language your code generates) - and branches can be far more expensive than a floating point add.
The key idea is to time two loops with lots of repetitions, each with the loop body containing the same statement replicated several times. If they differ in the number of repetitions, then subtracting the time for each will also subtract off the overhead time of setting up and managing the loop - branches and all. Here is the technique for addition:
time_1 = mytime()
for k = 1, ... , repetitions
x = x + y
.
.
.
x = x + y
End for
time_2 = mytime()
for k = 1, ... , $ repetitions
x = x + y
.
.
.
x = x + y
End for
time_3 = mytime()
where the first loop body has the operation done k1 times and the
second has it done k2 times. Then subtract to get the cost of
doing
repetitions*(k2 - k1)
of the simple operation, where k2 > k1.
Problems you need to handle when doing this:
Another point: this page was originally written in 1992, but has been updated several times. Over that time, CPU hardware has gotten much better at handling the branching that occurs in any loop, using speculative execution, predictive branching, etc. So the claim about it taking far longer than updating the loop counter, etc. may well be completely wrong. Anyone want to develop some code that can experimentally determine this? And no, I won't trust vendor documentation or claims - it needs to be demonstrable by crafting and running some test independent of the vendor.