A simple puzzle to start with in P573
Last week I ran a simple "vector update" operation
y = y + α*x
where
α is a double precision scalar and x, y are vectors of doubles
of length n = 100 million. This kind of vector update operation is
common in scientific computing, easily accounting for over 10% of the
floating point operations done. The vector lengths may look excessive, but
nowadays researchers are routinely working with systems 2-3 orders of
magnitude larger.
The computer I used has
-
Quad-core Intel processor
-
3.2 Ghertz CPU
-
4 GB memory
-
"Fused add-multiply" floating point unit, does one floating point
64-bit precision add and one multiply per clock cycle
The last item is true for every general CPU made in the last 15 years,
and it means that in theory the peak performance in terms of floating
point operations (called "flops" from now on) is
- 2*(Ghertz rating of the CPU)*109 per second.
Using just one core the maximum computational rate on my 3.2 Ghertz computer
should be
-
2 flop/cycle * 3.2e9 cycle/sec = 6.4e9 flop/sec = 6.4 Gflop/sec
For n = 100,000,000 = 100 million, the actual measured computational rate was
So my code is running at
-
370e6/6.4e9 = 0.057813 ≈ 1/17
of the possible peak performance. That is abysmal; why did it happen?
Some possible reasons why the vector update code is slow:
- Not enough memory
-
During the run, the system had 1.1 Gbytes free, over 27% of available memory
-
Timing perturbations from other jobs
-
Load averages via the "uptime" command:
before starting, all 0.0
afterwards, 1.09, 0.73, 0.51
-
Vector update job was allocated 99.9% of the processor.
Using the "time" command shows the job had
-
217.002u 1.692s 3:38.73 99.9% 0+0k 0+0io 0pf+0w
Reading this admittedly cryptic message also says there was
no I/O and no page faults.
-
Network was shutdown, no window manager (console mode), no journaling
file system, no monitoring jobs running. I.e., about as
pristine of an environment possible.
-
Only one user was logged in. Me.
-
Only I/O was to write out the time and rate, but was done outside
of the timing block
-
Timing resolution and/or overhead of calling the timer
-
Ran 307 repetitions inside a single timing block, which took
about 216.79 seconds between calls to the timer/clock. Could use
a wristwatch to measure that long of a time interval. Could
probably have used an hour glass or water clock.
-
Bad choice of programming language and compiler options
-
The code was wrong.
-
Always a possibility. Although the operation takes only a few lines of code, the
ability of humans to screw up is unlimited.
-
Professors can't write halfway decent code
-
The ability of professors to screw up is even more unlimited than that of
other humans
-
The relatively low performance is intrinsic in the operation
itself, and will happen on any computer in any language using
any standard compiler and with any reasonable implementation.
The last item is what causes the abysmal performance here.
It seems impossible to establish this since it is a
claim for all existant and future computers, but is
actually pretty easy. Among other things, in this course you will
develop the skills and tools to identify when the poor performance
is intrinsic to the computational job, and if it isn't,
how to go about improving things.
- Started: 28 Aug 2009, 10:26:58
- Last Modified: Mon 31 Aug 2009, 11:21 AM