Pipelining

Pipelining is pretty much what it sounds like, although a better mental image is probably an assembly line, where an object moves along a conveyor belt through stations, with different things done at each station. At any one time several objects are being worked on simultaneously, with different operations being performed at each stage.


Instruction Pipelining

Most computer operations consist of multiple stages. E.g., addition requires comparing exponents, shifting mantissas, adding mantissas, then normalizing the result. Pipelining helps keep the hardware components executing those stages simultaneously busy. First consider instruction pipelining.


Instruction Pipelining vs. Time

[pipeline.gif]

One problem with instruction pipelining is that a "conditional branch" prevents knowing which instruction is next. For example:

      for k = 1:n
         y(k) = y(k) + a*x(k)
      end for
      s = 3
is turned into sequence of instructions
     k = 1
(*)  load reg1 with x(k)
     load reg2 with y(k)
     reg2 = a*reg1 + reg2
     store reg2 into y(k)
     k = k + 1
     if (k < n or k = n) go to (*)
     s = 3
Here reg1, reg2 designate registers. Which instruction will follow the "if" statement? That is unknown until after the test is evaluated, introducing a "bubble" into the pipeline.


Unrolling

One common technique to help enhance pipelining is loop unrolling. These examples will use an unrolling depth of 4, but in practice deeper unrolling is more effective. Determining an optimal unrolling depth usually requires experimentation and some knowledge of the target CPU's resources. Unrolling turns the saxpy code

      for k = 1:n
         y(k) = y(k) + a*x(k)
      end for
into two loops:
      for k = 1:mod(n,4)
         y(k)   = a*x(k)   + y(k)
      end for

      for k = mod(n,4)+1:4:n (the middle 4 in the triplet means "in steps of 4")
         y(k)   = a*x(k)   + y(k)
         y(k+1) = a*x(k+1) + y(k+1)
         y(k+2) = a*x(k+2) + y(k+2)
         y(k+3) = a*x(k+3) + y(k+3)
      end for
      s = 3
How does this help? This unrolling: When is loop unrolling going to be a loser in terms of efficiency? Why on earth doesn't the optimizing compiler do this for you? After all, this is a simple, mindless rearrangment of instructions which don't depend on one another.

As another example, consider the dotproduct of vectors x and y:

      s = 0.0
      for k = 1:n
         s = s + x(k)*y(k)
      end for
Straightforward unrolling turns the code into
      s = 0.0
      for k = 1:mod(n,4)
         s  = s + x(k)*y(k)
      end for
      for k = mod(n,4)+1:4:n
         s  = s + x(k  )*y(k  )
                + x(k+1)*y(k+1)
                + x(k+2)*y(k+2)
                + x(k+3)*y(k+3)
      end for
This reduces loop overhead, but still won't execute rapidly. The problem is the dependency between iterations: an iteration cannot begin before the earlier one is completely finished, because the variable s must be written on iteration k before it can be read on iteration k+4. We can remove that dependency by using four temporary variables, and sum them up after the loop finishes:
      s  = 0.0
      s1 = 0.0
      s2 = 0.0
      s3 = 0.0
      for k = 1:mod(n,4)
         s  = s + x(k)*y(k)
      end for
      for k = mod(n,4)+1:4:n
         s   =  s + x(k  )*y(k  )
         s1  = s1 + x(k+1)*y(k+1)
         s2  = s2 + x(k+2)*y(k+2)
         s3  = s3 + x(k+3)*y(k+3)
      end for
      s = s + s1 + s2 + s3
This runs faster, but there is is something "wrong" with this that wasn't encountered with the saxpy operation. In particular, an optimizing compiler cannot make this change without risking changing the results of your computations. What is the problem?


Experimental Effects of Loop Unrolling on Dotproducts and Daxpy

The important question is, does pipelining really help much? The answer is "yes", but for simple cases like daxpy or dotproduct you should let a compiler handle the details of optimization using unrolling.

In the next graph the computational rate in Mflops per second is plotted against the length of the vector on which dotproduct and saxpy (vector update) operations are performed. The data labeled with "un" prefixed is from unrolling the loops of the un-"un"ed versions.

[unrolling.png]

That does provide a good boost, but consider the effects of compiler optimization on a quad-core i7 Intel machine. The next four plots show the ratio of computational rate of unrolled to standard loops for three kernels:

  1. dotpxx, the dotproduct of a vector with itself: α = xTx
  2. dotp, the dotproduct of two different vectors: α = xTy
  3. saxpy, the vector update of a vector by a scalar times another vector: y = y + α*x
For each of the next four plots, a horizontal line is drawn at y = 1; ratios above that line indicate that the unrolled code is faster.

Compiler Optimization -O0
[unroll0.png]

Compiler Optimization -O1
[unroll1.png]

Compiler Optimization -O2
[unroll2.png]

Compiler Optimization -O3
[unroll3.png]

As soon as the optimization -O2 is applied, the hand-unrolled code is always slower than the straight-forward loops without unrolling. At most for -O0, the performance is 2.5 times faster, which matches the general rule that floating point units are pipelined to a depth of 2-3. In any case, the conclusion is: for simple vector operations, let the GNU foundation (or Intel, or Portland Group, or IBM) do the heavy lifting and handle this optimization for you.


So Why Bother?

Unrolling does improve the performance, but why bother if compilers take care of it so well? The answer is that compilers cannot reliably recognize the opportunities to improve pipelining, especially when data dependencies intervene. Pipelining opportunities can occur anywhere, and not just in loops implementing basic vector operations. For example, a code for solving linear systems of equations with a tridiagonal coefficient matrix stores the three nonzero diagonals in arrays c( ), d( ), and e( ). During the solve, "pivoting" involves swapping entries in the vectors, which in one package is done as:

    t      = c(kp1)
    c(kp1) = c(k)
    c(k)   = t

    t      = d(kp1)
    d(kp1) = d(k)
    d(k)   = t

    t      = e(kp1)
    e(kp1) = e(k)
    e(k)   = t
[The index variable kp1 is k+1]. The performance problem is that the temporary variable t is used in three consecutive blocks. Instead, use three temporaries:
    t1     = c(k+1)
    c(k+1) = c(k)
    c(k)   = t1

    t2     = d(k+1)
    d(k+1) = d(k)
    d(k)   = t2

    t3     = e(k+1)
    e(k+1) = e(k)
    e(k)   = t3
This change made the overall code system run 30% faster. Always consider pipelining effects, both within loops and elsewhere, when developing high-performance kernels. The question is how much effort to put into that, and the analytical model for pipelining later will help answer that.


Memory and Flop Pipelining

Instruction pipelining has several potential problems. Fortunately, in scientific computing most pipelining is from repeated operations such as memory accesses and floating point operations. Those operations usually are not as difficult to pipeline as a general instruction cycle. As an example consider a floating point addition of two vectors: This can be illustrated by an example with 4-bit floating point arithmetic. The processor is assumed to have four bit floating point units to save typing:
  C:
       x  =  20 x 0.1011
       y  =  21 x 0.1100
  S:
       x  =  21 x 0.0101
       y  =  21 x 0.1100
  A:
       z  =  21 x 1.0001
  N:
       z  =  22 x 0.1000
"Normalizing" the result means making a "1" bit the first significant bit as the IEEE floating point representation requires. In that case the leading 1 bit does not need to be explicitly stored. However, here the leading 1 bit has been explicitly written.

For a sequence of n additions like this, the operations are

    z1 = x1 + y1
    z2 = x2 + y2
    .    .    .
    .    .    .
    .    .    .
    zn = xn + yn
This generates the following timing chart:
Cycle   C          S          A          N        Out          
1       x1,y1  
2       x2,y2    x1,y1  
3       x3,y3    x2,y2       x1,y1 
4       x4,y4    x3,y3       x2,y2      x1,y1   
5       x5,y5    x4,y4       x3,y3      x2,y2       z1  
6       x6,y6    x5,y5       x4,y4      x3,y3       z2  
7       x7,y7    x6,y6       x5,y5      x4,y4       z3  
   .    .                                          .         
   .    .                                          .         
   .    .                                          .         
n       xn,yn   xn-1,yn-1    xn-2,yn-2  xn-3,yn-3    zn-4  
n+1               xn,yn      xn-1,yn-1  xn-2,yn-2   zn-3  
n+2                            xn,yn    xn-1,yn-1   zn-2  
n+3                                       xn,yn     zn-1  
n+4                                                 zn
Important features to notice:


Modeling Pipelining

A timing model for a pipeline is easily developed, but at the cost of (temporarily) ignoring anyk memory hierarchies. One headache at a time, right? Also, only pipelines for operations (not instruction pipelining) will be addressed to make life easier. This means the model is for vector pipelining only.

Starting up a pipeline costs time, and that needs to be accounted for. Let

Now compare the timings with and without pipelining. Let n½ = number of operations which could be done at the asymptotic rate during a time equal to startup time. Then n½ = t0 (why?). So the time to carry out the complete process on a vector of length n with pipelining is

tp(n) = (n + n½)/rinf ,

which is again a linear function in n . What is the x-intercept of this linear function? Try graphing it, identifying all the usual linear parameters like intercepts and slope.

rinf is the asymptotic rate of computation for an infinitely long vector. The actual rate of computation for a vector of length n is

rn = n/tp = (nrinf)/(n+n½) = (rinf)/(1 + ρ) ,

where ρ = n½/n. That means n = n½ is the vector length that achieves half of the asymptotic rate, explaining why it has ½ as a subscript. The model gives the rate of computation versus vector length. Since machines can vary in their peak speed, let's normalize the graph by plotting (ρ/rinf) versus (n/n½)

[pics/rinf.png]

Some values have been marked: when (n/n½) = 1, then (ρ/rinf) = 0.5 -- which just restates the definition of n½. To get 90% of the asymptotic peak performance, the vector length must have (n/n½) = 9, or equivalently, n = 9*n½.

[The Matlab script that created the image above is an example of how to add bells and whistles to a plot in Matlab.]

Questions: How can n½ and rinf be computed for a given machine? What problem might be encountered? Why was the start-up time modeled, but not the time it takes to drain a pipeline?


Computing n½ and rinf from timings

The analytic model is
tp(n) = (n + n½)/rinf = r-1inf n½ + r-1 inf n,

so time the vector operation and get t1, t2, ... , tk for k corresponding values of n: n1, n2, ... , nk. Each "observation" takes the form

ti = r-1 inf n½ + r-1 inf ni,

which gives a linear system Ax = b, where A is the k × 2 matrix

     |n1         1 |
     |n2         1 |
 A = |n3         1 |
     |.          . |
     |.          . |
     |.          . |
     |nk         1 |
The observations can (and should) have repetitions: many timings can be made for a single vector size like n = 100 and added as rows in A and timings in b. Here b is the k × 1 vector b = (t1, t2, ... , tk)T, and x is the 2 × 1 vector (r-1 inf ,  r-1 inf n½)T.

Since there are only two unknowns and usually k >> 2 observations, this is a least squares problem: find the value of x that minimizes the two-norm of Ax -b. Matlab has an easy way of doing this: x = A\b. Then rinf and n½ can be extracted from the two components in the solution x.