Pipelining


Pipelining is pretty much what it sounds like, although a better image is probably an "assembly line", where an object moves along a conveyor belt through stations, and something is done at each station. At any one time several objects are being worked on, with different operations being performed on them at each stage.

Instruction Pipelining

Most computer operations consist of multiple stages. E.g., addition requires comparing exponents, shifting mantissas, adding mantissas, then normalizing the result. Pipelining helps keep the hardware components executing those stages simultaneously busy. First consider instruction pipelining.

Instruction Pipelining vs. Time

[pipeline.gif]

Problem with instruction pipelining is that a "conditional branch" prevents knowing which instruction is next. For example:

      for k = 1:n
         y(k) = y(k) + a*x(k)
      end for
      s = 3
is turned into sequence of instructions
     k = 1
(*)  load reg1 with x(k)
     load reg2 with y(k)
     reg2 = a*reg1 + reg2
     store reg2 into y(k)
     k = k + 1
     if (k < n or k = n) go to (*)
     s = 3
Here reg1, reg2 are registers. Which instruction will follow the "if" statement? That is unknown until after the test is evaluated, introducing a "bubble" into the pipeline.

One common technique in computational science to help enhance pipelining is loop unrolling.  This turns the above code into

      for k = 1:mod(n,4)
         y(k)   = a*x(k)   + y(k)
      end for

      for k = mod(n,4)+1:n in steps of 4
         y(k)   = a*x(k)   + y(k)
         y(k+1) = a*x(k+1) + y(k+1)
         y(k+2) = a*x(k+2) + y(k+2)
         y(k+3) = a*x(k+3) + y(k+3)
      end for
      s = 3
What does this do? When is loop unrolling going to be a loser in terms of efficiency? Why on earth doesn't the optimizing compiler do this for you? After all, this is a simple mindless rearrangment of instructions which don't depend on one another.

Another example: the inner product of vectors x and y.

      s = 0.0
      do k = 1, n
         s = s + x(k)*y(k)
      end do
Straightforward unrolling turns the code into
      s = 0.0
      do k = 1, mod(n,4)
         s  = s + x(k)*y(k)
      end do
      do k = mod(n,4)+1, n, 4
         s  = s + x(k  )*y(k  )
     .          + x(k+1)*y(k+1)
     .          + x(k+2)*y(k+2)
     .          + x(k+3)*y(k+3)
      end do
This reduces loop overhead, but still won't execute rapidly. The problem is the dependency between iterations: an iteration cannot begin before the earlier one is completely finished and s is written, since the value of s used is not known until then. So try again:
      s  = 0.0
      s1 = 0.0
      s2 = 0.0
      s3 = 0.0
      do k = 1, mod(n,4)
         s  = s + x(k)*y(k)
      end do
      do k = mod(n,4)+1, n, 4
         s   =  s + x(k  )*y(k  )
         s1  = s1 + x(k+1)*y(k+1)
         s2  = s2 + x(k+2)*y(k+2)
         s3  = s3 + x(k+3)*y(k+3)
      end do
      s = s + s1 + s2 + s3
This runs faster, but there is is something "wrong" with this that wasn't wrong with previous example. In particular, an optimizing compiler cannot make this change without risking changing the results of your computations. What is the problem here? More importantly, does pipelining really help much?

Effects of Unrolling the Dotproduct Operation

[unrolling.png]

Above, the computational rate in Mflops per second is plotted against the length of the vector on which a daxpy (vector update) operation is performed.  Note that unrolling really did improve the performance, but in order to show such effects I had to compile the code with absolutely no optimizations allowed - which is why both the rolled and unrolled versions have such abysmal performance on a machine with theoretical peak performance of 50 Mflops/sec!

So why bother with this if compilers take care of it so well?  Pipelining can occur anywhere, and not just in small loops. For example, a code for solving linear systems of equations with a tridiagonal coefficient matrix stores the three diagonals in vectors c(), d(), and e(). During the solve, pivoting involves swapping entries in the vectors, which in one sedimentary basin modeling package is done as:

    t      = c(kp1)
    c(kp1) = c(k)
    c(k)   = t

    t      = d(kp1)
    d(kp1) = d(k)
    d(k)   = t

    t      = e(kp1)
    e(kp1) = e(k)
    e(k)   = t
The problem is that the temporary variable t is used in three consecutive blocks. Instead, use three temporaries:
    t1     = c(k+1)
    c(k+1) = c(k)
    c(k)   = t1

    t2     = d(k+1)
    d(k+1) = d(k)
    d(k)   = t2

    t3     = e(k+1)
    e(k+1) = e(k)
    e(k)   = t3
This alone made the code run 30% faster. You should consider pipelining effects, both within loops and elsewhere in your code, when trying to develop high-performance kernels. The question is how much effort to put into that, and an analytical model will help answer that.

Memory and Flop Pipelining

Instruction pipelining has several potential problems. Fortunately, in scientific computing you mostly have to just deal with pipelining for repeated operations such as memory accesses and floating point operations. These operations usually are not as difficult to pipeline as instruction cycle. As an example consider a floating point add: Let's illustrate this with 4-bit floating point arithmetic:
  C:
       x  =  20 x 0.1011
       y  =  21 x 0.1100
  S:
       x  =  21 x 0.0101
       y  =  21 x 0.1100
  A:
       z  =  21 x 1.0001
  N:
       z  =  22 x 0.1000
Note that here "normalizing" the result means making a "1" bit the first singificant bit (as the IEEE floating point representation requires). However, here that 1 bit has been explicitly written.

Now if you have a sequence of n adds like this to do

    z1 = x1 + y1
    z2 = x2 + y2
    .    .    .
    .    .    .
    .    .    .
    zn = xn + yn
You get the following timing chart:
Cycle   C          S          A          N        Out          
1       x1,y1  
2       x2,y2    x1,y1  
3       x3,y3    x2,y2       x1,y1 
4       x4,y4    x3,y3       x2,y2      x1,y1   
5       x5,y5    x4,y4       x3,y3      x2,y2       z1  
6       x6,y6    x5,y5       x4,y4      x3,y3       z2  
7       x7,y7    x6,y6       x5,y5      x4,y4       z3  
   .    .                                          .         
   .    .                                          .         
   .    .                                          .         
n       xn,yn   xn-1,yn-1    xn-2,yn-2  xn-3,yn-3    zn-4  
n+1               xn,yn      xn-1,yn-1  xn-2,yn-2   zn-3  
n+2                            xn,yn    xn-1,yn-1   zn-2  
n+3                                       xn,yn     zn-1  
n+4                                                 zn
Some notes:

Pipelining Models

A timing model for a pipeline is easily developed, but at the cost of (temporarily) ignoring what we already know about memory hierarchies. But one headache at a time, right? Also, we will only model pipelines for operations (not instruction pipelining) to make life easier. So we are dealing with vector pipelining.

Starting up a pipeline costs time, and that needs to be accounted for. Let

Then the time to compute without pipelining on a vector of length n Let n½ = number of operations which could be done at the asymptotic rate during a time equal to startup time. Then n½ = t0/tau (why?). So the time to carry out the complete process on a vector of length n with pipelining is

tp(n) = (n + n½)/rinf,

a linear function in n. What is the x-intercept of this linear function? Try graphing it, identifying all the quantities like intercepts and slope.

The rate of computation for a vector of length n is

rn = n/tp = (nrinf)/(n+n½) = (rinf)/(1 + rho),

where rho = n½/n. So n½ is the vector length for which half the asymptotic rate is achieved, explaining its subscript.

How would you compute n½ and rinf for a given machine? What problem might be encountered? Why did we only model the start-up time, but not the time it takes to drain a pipeline?


Computing n½ and rinf

The model is
tp(n) = (n + n½)/rinf = r-1inf n½ + r-1 inf n,

so measure times t1, t2, ... , tk for k values of n: n1, n2, ... , nk. Each ``observation'' takes the form

ti = r-1 inf n½ + r-1 inf ni,

which we can write as a linear system Ax = b, where A is the k × 2 matrix

     |n1         1 |
     |n2         1 |
 A = |n3         1 |
     |.          . |
     |.          . |
     |.          . |
     |nk         1 |

Here b is the k × 1 vector b = (t1, t2, ... , tk)T, and x is the 2 × 1 vector ( r-1 inf       r-1 inf n½)T.

Since there are only two unknowns and k > 2 observations, this is a least squares problem: find the value of x that minimizes the two-norm of Ax -b. Matlab has an easy way of doing this: x = A\b. Then rinf and n½ can be extracted from the two components in the solution x.