P573: Pipelining

Pipelining

Pipelining is pretty much what it sounds like, although a better mental image is probably an assembly line, where an object moves along a conveyor belt through stations, with different things done at each station. At any one time several objects are being worked on simultaneously, with different operations being performed at each stage.

Instruction Pipelining

Most computer operations consist of multiple stages. E.g., addition requires comparing exponents, shifting mantissas, adding mantissas, then normalizing the result. Pipelining helps keep the hardware components executing those stages simultaneously busy. First consider instruction pipelining.

The traditional (von Neumann) model of computer instruction cycle:

Instruction fetched
Instruction decoded
Operands for instruction fetched from memory
Instruction executed
Results written back to memory

Each stage takes at least one clock cycle (usually a few tens of nanoseconds circa 2019).
Speed this up using an assembly line (instruction pipelining):

Instruction Pipelining vs. Time

One problem with instruction pipelining is that a "conditional branch" prevents knowing which instruction is next. For example:

      for k = 1:n
         y(k) = y(k) + a*x(k)
      end for
      s = 3

is turned into sequence of instructions

     k = 1
(*)  load reg1 with x(k)
     load reg2 with y(k)
     reg2 = a*reg1 + reg2
     store reg2 into y(k)
     k = k + 1
     if (k < n or k = n) go to (*)
     s = 3

Here reg1, reg2 designate registers. Which instruction will follow the "if" statement? That is unknown until after the test is evaluated, introducing a "bubble" into the pipeline.

Unrolling

One common technique to help enhance pipelining is loop unrolling. These examples will use an unrolling depth of 4, but in practice deeper unrolling is more effective. Determining an optimal unrolling depth usually requires experimentation and some knowledge of the target CPU's resources. Unrolling turns the saxpy code

      for k = 1:n
         y(k) = y(k) + a*x(k)
      end for

into two loops:

      for k = 1:mod(n,4)
         y(k)   = a*x(k)   + y(k)
      end for

      for k = mod(n,4)+1:4:n (the middle 4 in the triplet means "in steps of 4")
         y(k)   = a*x(k)   + y(k)
         y(k+1) = a*x(k+1) + y(k+1)
         y(k+2) = a*x(k+2) + y(k+2)
         y(k+3) = a*x(k+3) + y(k+3)
      end for
      s = 3

How does this help? This unrolling:

increases amount of work that can be done before branch encountered
decreases the number of branches by factor of 4
decreases the loop control arithmetic
exposes computations; allows the compiler to make better utilization of registers.
On the minus side, it also requires the clean-up loop. In the code fragment above it is shown before the unrolled loop, which explains why this is sometimes called a preconditioning loop.

When is loop unrolling going to be a loser in terms of efficiency? Why on earth doesn't the optimizing compiler do this for you? After all, this is a simple, mindless rearrangment of instructions which don't depend on one another.

As another example, consider the dotproduct of vectors x and y:

      s = 0.0
      for k = 1:n
         s = s + x(k)*y(k)
      end for

Straightforward unrolling turns the code into

      s = 0.0
      for k = 1:mod(n,4)
         s  = s + x(k)*y(k)
      end for
      for k = mod(n,4)+1:4:n
         s  = s + x(k  )*y(k  )
                + x(k+1)*y(k+1)
                + x(k+2)*y(k+2)
                + x(k+3)*y(k+3)
      end for

This reduces loop overhead, but still won't execute rapidly. The problem is the dependency between iterations: an iteration cannot begin before the earlier one is completely finished, because the variable s must be written on iteration k before it can be read on iteration k+4. We can remove that dependency by using four temporary variables, and sum them up after the loop finishes:

      s  = 0.0
      s1 = 0.0
      s2 = 0.0
      s3 = 0.0
      for k = 1:mod(n,4)
         s  = s + x(k)*y(k)
      end for
      for k = mod(n,4)+1:4:n
         s   =  s + x(k  )*y(k  )
         s1  = s1 + x(k+1)*y(k+1)
         s2  = s2 + x(k+2)*y(k+2)
         s3  = s3 + x(k+3)*y(k+3)
      end for
      s = s + s1 + s2 + s3

This runs faster, but there is is something "wrong" with this that wasn't encountered with the saxpy operation. In particular, an optimizing compiler cannot make this change without risking changing the results of your computations. What is the problem?

Experimental Effects of Loop Unrolling on Dotproducts and Daxpy

The important question is, does pipelining really help much? The answer is "yes", but for simple cases like daxpy or dotproduct you should let a compiler handle the details of optimization using unrolling.

In the next graph the computational rate in Mflops per second is plotted against the length of the vector on which dotproduct and saxpy (vector update) operations are performed. The data labeled with "un" prefixed is from unrolling the loops of the un-"un"ed versions.

That does provide a good boost, but consider the effects of compiler optimization on a quad-core i7 Intel machine. The next four plots show the ratio of computational rate of unrolled to standard loops for three kernels:

dotpxx, the dotproduct of a vector with itself: α = x^Tx
dotp, the dotproduct of two different vectors: α = x^Ty
saxpy, the vector update of a vector by a scalar times another vector: y = y + α*x

For each of the next four plots, a horizontal line is drawn at y = 1; ratios above that line indicate that the unrolled code is faster.

Compiler Optimization -O0 [unroll0.png]

Compiler Optimization -O1 [unroll1.png]

Compiler Optimization -O2 [unroll2.png]

Compiler Optimization -O3 [unroll3.png]

As soon as the optimization -O2 is applied, the hand-unrolled code is always slower than the straight-forward loops without unrolling. At most for -O0, the performance is 2.5 times faster, which matches the general rule that floating point units are pipelined to a depth of 2-3. In any case, the conclusion is: for simple vector operations, let the GNU foundation (or Intel, or Portland Group, or IBM) do the heavy lifting and handle this optimization for you.

So Why Bother?

Unrolling does improve the performance, but why bother if compilers take care of it so well? The answer is that compilers cannot reliably recognize the opportunities to improve pipelining, especially when data dependencies intervene. Pipelining opportunities can occur anywhere, and not just in loops implementing basic vector operations. For example, a code for solving linear systems of equations with a tridiagonal coefficient matrix stores the three nonzero diagonals in arrays c( ), d( ), and e( ). During the solve, "pivoting" involves swapping entries in the vectors, which in one package is done as:

    t      = c(kp1)
    c(kp1) = c(k)
    c(k)   = t

    t      = d(kp1)
    d(kp1) = d(k)
    d(k)   = t

    t      = e(kp1)
    e(kp1) = e(k)
    e(k)   = t

[The index variable kp1 is k+1]. The performance problem is that the temporary variable t is used in three consecutive blocks. Instead, use three temporaries:

    t1     = c(k+1)
    c(k+1) = c(k)
    c(k)   = t1

    t2     = d(k+1)
    d(k+1) = d(k)
    d(k)   = t2

    t3     = e(k+1)
    e(k+1) = e(k)
    e(k)   = t3

This change made the overall code system run 30% faster. Always consider pipelining effects, both within loops and elsewhere, when developing high-performance kernels. The question is how much effort to put into that, and the analytical model for pipelining later will help answer that.

Memory and Flop Pipelining

Instruction pipelining has several potential problems.

The stages take greatly differing amounts of time; operand fetch requires memory access which may be several orders of magnitude larger.
Different stages may need the same resources simultaneously; e.g., address calculation and operation execution may both need the add unit.
Interrupts.

Fortunately, in scientific computing most pipelining is from repeated operations such as memory accesses and floating point operations. Those operations usually are not as difficult to pipeline as a general instruction cycle. As an example consider a floating point addition of two vectors:

z = x+y, with x = 2^p x t₁ and y = 2^q x t₂
t_i are the mantissas and p and q are the exponents
Generally the mantissa is "normalized" so that the leading bit is 1, and the binary point precedes it (this is required by the IEEE standard).
The typical stages in a floating point addition are:

C) Compare exponents
S) Shift mantissa
A) Add mantissas
N) Normalize result

This can be illustrated by an example with 4-bit floating point arithmetic. The processor is assumed to have four bit floating point units to save typing:

  C:
       x  =  2⁰ x 0.1011
       y  =  2¹ x 0.1100
  S:
       x  =  2¹ x 0.0101
       y  =  2¹ x 0.1100
  A:
       z  =  2¹ x 1.0001
  N:
       z  =  2² x 0.1000

"Normalizing" the result means making a "1" bit the first significant bit as the IEEE floating point representation requires. In that case the leading 1 bit does not need to be explicitly stored. However, here the leading 1 bit has been explicitly written.

For a sequence of n additions like this, the operations are

    z₁ = x₁ + y₁    z₂ = x₂ + y₂    .    .    .
    .    .    .
    .    .    .
    z_n = x_n + y_n

This generates the following timing chart:

Cycle   C          S          A          N        Out          
1       x₁,y₁  
2       x₂,y₂    x₁,y₁  
3       x₃,y₃    x₂,y₂       x₁,y₁ 
4       x₄,y₄    x₃,y₃       x₂,y₂      x₁,y₁   
5       x₅,y₅    x₄,y₄       x₃,y₃      x₂,y₂       z₁  
6       x₆,y₆    x₅,y₅       x₄,y₄      x₃,y₃       z₂  
7       x₇,y₇    x₆,y₆       x₅,y₅      x₄,y₄       z₃  
   .    .                                          .         
   .    .                                          .         
   .    .                                          .         
n       x_n,y_n   x_n-1,y_n-1    x_n-2,y_n-2  x_n-3,y_n-3    z_n-4  
n+1               x_n,y_n      x_n-1,y_n-1  x_n-2,y_n-2   z_n-3  
n+2                            x_n,y_n    x_n-1,y_n-1   z_n-2  
n+3                                       x_n,y_n     z_n-1  
n+4                                                 z_n

Important features to notice:

At the beginning there is a period of time when the pipeline gets "filled", during which no results appear. If the pipe has s stages, nothing comes out during the first s-1 clock cycles.
Similarly, at the end the pipe has to be drained when the last operands are fed in. This typically means putting in zero values to push the final result out the pipe.
The same optimization techniques work here as for instruction pipelining.
Memory pipelining involves a sequence of memory addresses which can be stacked up as requests to the memory system.
Most systems have floating point add and multiply pipes of depth two or more where "depth" is defined to be s, the number of stages.
Pipelines are automatically set up by compilation. For operations like the inner product where the order of operations may change, usually an "aggressive" optimization option is required to get this to happen.

Modeling Pipelining

A timing model for a pipeline is easily developed, but at the cost of (temporarily) ignoring anyk memory hierarchies. One headache at a time, right? Also, only pipelines for operations (not instruction pipelining) will be addressed to make life easier. This means the model is for vector pipelining only.

Starting up a pipeline costs time, and that needs to be accounted for. Let

τ = clock period
n = vector length, assumed to be the number of operations.
s = number of segments in pipe
S = set up time.

S = 0

Now compare the timings with and without pipelining.

The time to compute the vector addition without pipelining on a vector of length n is
```
 t_s = sequential time = n*s*τ 
```

The time required with pipelining is

t_p = piped time = setup + (time for first result) + one cycle * (number remaining ops)

The magic of algebra shows t_p = τ*S + τ*s + τ*(n-1) = t₀ + τ*n, where t₀ = the startup time. More formally, t₀ = τ*S + τ*s - τ, so the model is linear in n. The last form lumps together the constants; it is slightly easier to compute with n rather than n-1.
The rate of sequential execution is r_s = (num ops)/(sequential time) = n/t_s = 1/(τ*s).
The asymptotic rate (rate for an infinitely long vector) of pipelined execution is r_inf = lim_{{n → ∞}} n/t_p = 1/τ

Let n_½ = number of operations which could be done at the asymptotic rate during a time equal to startup time. Then n_½ = t₀/τ (why?). So the time to carry out the complete process on a vector of length n with pipelining is

t_p(n) = (n + n_½)/r_inf ,

which is again a linear function in n . What is the x-intercept of this linear function? Try graphing it, identifying all the usual linear parameters like intercepts and slope.

r_inf is the asymptotic rate of computation for an infinitely long vector. The actual rate of computation for a vector of length n is

r_n = n/t_p = (nr_inf)/(n+n_½) = (r_inf)/(1 + ρ) ,

where ρ = n_½/n. That means n = n_½ is the vector length that achieves half of the asymptotic rate, explaining why it has ½ as a subscript. The model gives the rate of computation versus vector length. Since machines can vary in their peak speed, let's normalize the graph by plotting (ρ/r_inf) versus (n/n_½)

Some values have been marked: when (n/n_½) = 1, then (ρ/r_inf) = 0.5 -- which just restates the definition of n_½. To get 90% of the asymptotic peak performance, the vector length must have (n/n_½) = 9, or equivalently, n = 9*n_½.

[The Matlab script that created the image above is an example of how to add bells and whistles to a plot in Matlab.]

Questions: How can n_½ and r_inf be computed for a given machine? What problem might be encountered? Why was the start-up time modeled, but not the time it takes to drain a pipeline?

Computing n_½ and r_inf from timings

The analytic model is t_p(n) = (n + n_½)/r_inf = r^-1_inf n_½ + r^-1 _inf n,
so time the vector operation and get t₁, t₂, ... , t_k for k corresponding values of n: n₁, n₂, ... , n_k. Each "observation" takes the form

t_i = r^-1 _inf n_½ + r^-1 _inf n_i,

which gives a linear system Ax = b, where A is the k × 2 matrix

     |n₁         1 |
     |n₂         1 |
 A = |n₃         1 |
     |.          . |
     |.          . |
     |.          . |
     |n_k         1 |

The observations can (and should) have repetitions: many timings can be made for a single vector size like n = 100 and added as rows in A and timings in b. Here b is the k × 1 vector b = (t₁, t₂, ... , t_k)^T, and x is the 2 × 1 vector (r^-1 _inf , r^-1 _inf n_½)^T.

Since there are only two unknowns and usually k >> 2 observations, this is a least squares problem: find the value of x that minimizes the two-norm of Ax -b. Matlab has an easy way of doing this: x = A\b. Then r_inf and n_½ can be extracted from the two components in the solution x.

Started: Aug 2011
Modified: Thu 24 Sep 2015, 02:57 PM for greek letters
Last Modified: Mon 04 Nov 2019, 06:56 AM