Final, Computer Science P573


Perform a load/store analysis for a lower triangular solve, both for the single and multiple right-hand sides cases.


Single right-hand side case

The single right-hand side case is to solve
L*u = v
for the n × 1 vector u, where L is a lower triangular n × n matrix and v is a given n × 1 vector. Since L is lower triangular, it only requires storing (n*(n+1))/2 doubles, while u and v each require n doubles. That makes getting the minimal possible number of memory references easy.

One pseudo-code algorithm for solving such a lower triangular system is



    for k = 1:n
        u(k) = v(k)
        for i = 1:k-1
            u(k) = u(k) - L(k,i)*u(i)
        end for
        u(k) = u(k)/L(k,k)
    end for


and that can be used to get the count of flops. From there, finding the memory reference to flop ratio and its asymptotic value is easy. Do so, then answer the following:

  1. A lower triangular solve with one right-hand side is a level-2 BLAS operation (why?). So how does it compare to the other two such creatures you've seen so far, matrix-vector multiply and rank-1 update? "Compare" means in terms of which should run faster/slower.
  2. The algorithm above can actually be done using one array for both u and v, which is the way it usually is done. Does using overwriting this way change the load/store analysis?
  3. What level-1 BLAS operation is implemented by the innermost loop in the code above? Does the load/store analysis imply that implementing it as a sequence of those level-1 BLAS operations is near-optimal, or that there is a better way?
  4. The inner and outer loops can be interchanged in the algorithm give above, provided the division by diagonal elements L(k,k) is handled properly. In that case, what vector operation does the innermost loop consist of?


Multiple right-hand side case

The major use (in terms of computational time and work) for triangular solves is where the vectors u and v are replaced by general n × p matrices U and V, where generally pn. Sometimes this is also called the block triangular solve.

An algorithm for doing this is to just do the single-right hand side case p times, one for each column of V. So this means invoking the pseudo-code above for j = 1:p, and solve

L*uj = vj
where uj is the j-th column of U and vj is the j-th column of V. At least doing so is suitable for getting a flop count for the multiple right-hand side version of triangular solve, so ... do that.
  1. What level of BLAS is this operation?
  2. What is the load-store ratio for the multiple right-hand side triangular solve? Keep in mind that we are assuming pn, so for the asymptotic case letting n → ∞ suffices to also drive p → ∞.
  3. What does this imply about the performance of the multiple right-hand side lower triangular solve?
  4. What does your load/store analysis imply about implementing the multiple right-hand side version as indicated above, as a sequence of single right-hand side solves?


More!

At this point you should be able to make some tentative conjectures (or leaps of faith) about the following:
  1. What if anything about the load/store analysis changes for solving an upper triangular system U*x = y? What about a multiple right-hand side upper triangular system?
  2. In LU factorization for solving a linear system of equations, the lower triangular matrix L will be "unit lower triangular", meaning the diagonal elements are all 1.0. In that case they are not stored explicitly and are never referenced. That reduces the amount of memory needed both theoretically and in practice. Does that change the asymptotic load/store analysis results?
For this last set, make an educated guess by taking a step back and considering what your previous computation of memory references and flops for triangular solves showed. You should be able to carry out the details, but it's a good idea to be able to be able to form an idea of what those details should show in advance.

Notes


Handin

Don't waste time trying to type in and format your answers MS Word, Latex, Lotus Notes, or Incan quipu. It's OK if you have lots of time and just want to learn how to typeset math in Latex (or tie knots in llama wool strings), but unless that's something you need to do for research papers or a dissertation, it's a waste of your time. You can just hand write it and then hand it to me, or put into my mailbox on the first floor of Luddy Hall, or send it by campus mail.

As always, if you work with or talks to someone other than me about this, just make a note of it on your handin.