Pipelining
Pipelining is pretty much what it sounds like, although a better image
is probably an "assembly line", where an object moves along a conveyor
belt through stations, and something is done at each station. At any one
time several objects are being worked on, with different operations being
performed on them at each stage.
Instruction Pipelining
Most computer operations consist of multiple stages. E.g., addition requires
comparing exponents, shifting mantissas, adding mantissas, then normalizing
the result. Pipelining helps keep the hardware components executing those
stages simultaneously busy. First consider instruction pipelining.
-
Traditional model of computer instruction cycle:
-
Instruction fetched
-
Instruction decoded
-
Operands for instruction fetched from memory
-
Instruction executed
-
Results written back to memory
-
Machines have fundamental clock cycle; each stage takes at least one clock
cycle (usually on order of tens of nanoseconds).
-
Speed this up using an assembly line (instruction pipelining):
Instruction Pipelining vs. Time
Problem with instruction pipelining is that a "conditional branch" prevents
knowing
which instruction is next. For example:
for k = 1:n
y(k) = y(k) + a*x(k)
end for
s = 3
is turned into sequence of instructions
k = 1
(*) load reg1 with x(k)
load reg2 with y(k)
reg2 = a*reg1 + reg2
store reg2 into y(k)
k = k + 1
if (k < n or k = n) go to (*)
s = 3
Here reg1, reg2 are registers. Which instruction will follow the "if" statement?
That is unknown until after the test is evaluated, introducing a "bubble"
into the pipeline.
One common technique in computational science to help enhance pipelining
is loop unrolling. This turns the above code into
for k = 1:mod(n,4)
y(k) = a*x(k) + y(k)
end for
for k = mod(n,4)+1:n in steps of 4
y(k) = a*x(k) + y(k)
y(k+1) = a*x(k+1) + y(k+1)
y(k+2) = a*x(k+2) + y(k+2)
y(k+3) = a*x(k+3) + y(k+3)
end for
s = 3
What does this do?
-
Increases amount of work that can be done before branch encountered.
-
Decreases the number of branches by factor of 4
-
Decreases the loop control arithmetic
-
Exposes computations; allows the compiler to make better utilization of
registers.
-
Requires the clean-up loop. I put it in the front to help explain
why this is sometimes called a "preconditioning" loop.
When is loop unrolling going to be a loser in terms of efficiency? Why
on earth doesn't the optimizing compiler do this for you? After all, this
is a simple mindless rearrangment of instructions which don't depend on
one another.
Another example: the inner product of vectors x and y.
s = 0.0
do k = 1, n
s = s + x(k)*y(k)
end do
Straightforward unrolling turns the code into
s = 0.0
do k = 1, mod(n,4)
s = s + x(k)*y(k)
end do
do k = mod(n,4)+1, n, 4
s = s + x(k )*y(k )
. + x(k+1)*y(k+1)
. + x(k+2)*y(k+2)
. + x(k+3)*y(k+3)
end do
This reduces loop overhead, but still won't execute rapidly. The problem
is the dependency between iterations: an iteration cannot begin before
the earlier one is completely finished and s is written, since the value
of s used is not known until then. So try again:
s = 0.0
s1 = 0.0
s2 = 0.0
s3 = 0.0
do k = 1, mod(n,4)
s = s + x(k)*y(k)
end do
do k = mod(n,4)+1, n, 4
s = s + x(k )*y(k )
s1 = s1 + x(k+1)*y(k+1)
s2 = s2 + x(k+2)*y(k+2)
s3 = s3 + x(k+3)*y(k+3)
end do
s = s + s1 + s2 + s3
This runs faster, but there is is something "wrong" with this that wasn't
wrong with previous example. In particular, an optimizing compiler cannot
make this change without risking changing the results of your computations.
What is the problem here? More importantly,
does pipelining really help much?
Effects of Unrolling the Dotproduct Operation
Above, the computational rate in Mflops per second is plotted against
the length of the vector on which a daxpy (vector update) operation is
performed. Note that unrolling really did improve the performance,
but in order to show such effects I had to compile the code with absolutely
no optimizations allowed - which is why both the rolled and unrolled versions
have such abysmal performance on a machine with theoretical peak performance
of 50 Mflops/sec!
So why bother with this if compilers take care of it so well?
Pipelining can occur anywhere, and not just in small loops. For example, a code
for solving linear systems of equations with a tridiagonal coefficient
matrix stores the three diagonals in vectors c(), d(), and e(). During
the solve, pivoting involves swapping entries in the vectors, which in
one sedimentary basin modeling package is done as:
t = c(kp1)
c(kp1) = c(k)
c(k) = t
t = d(kp1)
d(kp1) = d(k)
d(k) = t
t = e(kp1)
e(kp1) = e(k)
e(k) = t
The problem is that the temporary variable t is used in three
consecutive blocks. Instead,
use three temporaries:
t1 = c(k+1)
c(k+1) = c(k)
c(k) = t1
t2 = d(k+1)
d(k+1) = d(k)
d(k) = t2
t3 = e(k+1)
e(k+1) = e(k)
e(k) = t3
This alone made the code run 30% faster. You should consider pipelining
effects, both within loops and elsewhere in your code, when trying to develop
high-performance kernels. The question is how much effort to put into that,
and
an analytical model
will help answer that.
Memory and Flop Pipelining
Instruction pipelining has several potential problems.
-
The stages take greatly differing amounts of time; operand fetch requires
memory access which may be several orders of magnitude larger.
-
Different stages may need the same resources simultaneously; e.g., address
calculation and operation execution may both need the add unit.
-
Interrupts.
Fortunately, in scientific computing you mostly have to just deal with
pipelining for repeated operations such as memory accesses and floating
point operations. These operations usually are not as difficult to pipeline
as instruction cycle. As an example consider a floating point add:
-
Consider z = x+y, with x = 2p x t1 and y = 2q
x t2
-
ti is the mantissa and p and q are the exponents
-
Generally the mantissa is normalized so that the leading bit is 1, and
the binary point precedes it (this is required by the IEEE standard covered
elsewhwere).
-
Typical stages in add:
-
C) Compare exponents
-
S) Shift mantissa
-
A) Add mantissas
-
N) Normalize result
Let's illustrate this with 4-bit floating point arithmetic:
C:
x = 20 x 0.1011
y = 21 x 0.1100
S:
x = 21 x 0.0101
y = 21 x 0.1100
A:
z = 21 x 1.0001
N:
z = 22 x 0.1000
Note that here "normalizing" the result means making a "1" bit the first
singificant bit (as the IEEE floating point representation requires).
However, here that 1 bit has been explicitly written.
Now if you have a sequence of n adds like this to do
z1 = x1 + y1
z2 = x2 + y2
. . .
. . .
. . .
zn = xn + yn
You get the following timing chart:
Cycle C S A N Out
1 x1,y1
2 x2,y2 x1,y1
3 x3,y3 x2,y2 x1,y1
4 x4,y4 x3,y3 x2,y2 x1,y1
5 x5,y5 x4,y4 x3,y3 x2,y2 z1
6 x6,y6 x5,y5 x4,y4 x3,y3 z2
7 x7,y7 x6,y6 x5,y5 x4,y4 z3
. . .
. . .
. . .
n xn,yn xn-1,yn-1 xn-2,yn-2 xn-3,yn-3 zn-4
n+1 xn,yn xn-1,yn-1 xn-2,yn-2 zn-3
n+2 xn,yn xn-1,yn-1 zn-2
n+3 xn,yn zn-1
n+4 zn
Some notes:
-
There is a period of time when the pipeline gets "filled", during which
no results appear. If the pipe has s stages, nothing comes out the first
s-1 clock cycles.
-
Similarly, the pipe has to be "drained" when the last operands are fed
in. This typically means putting in zero values to "push" the final result
out the pipe.
-
Same optimization techniques work here as for instruction pipelining.
-
Memory pipelining involves a sequence of memory addresses which can be
stacked up as requests to the memory system.
-
Most workstations now have floating point add and multiply pipes of depth
two or more. Here "depth" is the number of stages.
-
Pipelines are automatically set up by compilation. For operations like
inner product where order of operations may change, usually an "aggressive"
optimization option is required to get this to happen.
Pipelining Models
A timing model for a pipeline is easily developed, but at the cost of (temporarily)
ignoring what we already know about memory hierarchies. But one headache
at a time, right? Also, we will only model pipelines for operations (not
instruction pipelining) to make life easier. So we are dealing with vector
pipelining.
Starting up a pipeline costs time, and that needs to be accounted for.
Let
-
tau = clock period
-
n = vector length, assumed to be the number of operations.
-
s = number of segments in pipe
-
S = set up time
. This is the time it takes to reserve the pipeline and
if the pipelines are chained to coordinate with an upstream pipeline. On
most desktop machines S = 0, but on large-scale vector machines it is
nontrivial
Then the time to compute without pipelining on a vector of length n
-
ts = sequential time = n*s*tau
-
The time required with pipelining is
tp = piped time = setup + (time for first result) + one cycle
* (number remaining ops), so
-
tp = tau*S + tau*s + tau*(n-1) = t0 + tau*n, where
t0 = the startup time. More formally,
t0 = tau*S + tau*s -tau*(-1) so the model is linear in n.
This lumps together the constants; it is slightly easier to compute with n
rather than n-1.
-
The rate of sequential execution is rs = (num ops)/(sequential
time) = n/ts = 1/(tau*s).
-
The asymptotic rate of pipelined execution is
rinf = lim{n --> inf} n/tp = 1/tau
Let n½ = number of operations which could be done at
the asymptotic rate during a time equal to startup time. Then n½
= t0/tau (why?). So the time to carry out the complete process
on a vector of length n with pipelining is
tp(n) = (n + n½)/rinf,
a linear function in n. What is the x-intercept of this linear function?
Try graphing it, identifying all the quantities like intercepts and slope.
The rate of computation for a vector of length n is
rn = n/tp = (nrinf)/(n+n½)
= (rinf)/(1 + rho),
where rho = n½/n. So n½ is the
vector length for which half the asymptotic rate is achieved, explaining
its subscript.
How would you compute n½ and rinf for a
given machine? What problem might be encountered? Why did we only model
the start-up time, but not the time it takes to drain a pipeline?
Computing n½ and rinf
The model is
tp(n) = (n + n½)/rinf = r-1inf
n½ + r-1 inf n,
so measure times t1, t2, ... , tk
for k values of n: n1, n2, ... , nk. Each
``observation'' takes the form
ti = r-1 inf n½
+ r-1 inf ni,
which we can write as a linear system Ax = b, where A is the k × 2 matrix
|n1 1 |
|n2 1 |
A = |n3 1 |
|. . |
|. . |
|. . |
|nk 1 |
Here b is the k × 1 vector b = (t1, t2, ... , tk)T,
and x is the 2 × 1 vector ( r-1 inf r-1 inf
n½)T.
Since there are only two unknowns and k > 2 observations,
this is a least squares problem: find the value of x that minimizes the
two-norm of Ax -b. Matlab has an easy way of doing this: x = A\b. Then
rinf and n½ can be extracted from the
two components in the solution x.
- Started: 17 Aug 1995
- Updated: Fri Sep 14 08:50:58 EST 2001
- Updated: Sun 09 Sep 2007
- Last Modified: Wed 16 Sep 2009, 08:32 PM