When using either of the two versions (rank-1 update and matrix-vector product) of LU factorization, the division by A(k,k) in the scaling operation can cause problems if A(k,k) is small in absolute value. Pivoting means bringing a larger element into that position by swapping rows or columns in the matrix. Partial pivoting means doing only row swaps; LU factorization is "usually stable" when partial pivoting is used. However, there are classes of problems (some two point boundary value problems in ODE's is one example) for which it can fail. For the rank-1 version this gives
for k = 1:n p = index of max element in |A(k:n,k)| piv(k) = p swap rows k and p of A: A(k,:) <---> A(p,:) A(k+1:n,k) = A(k+1:n,k)/A(k,k) A(k+1:n,k+1:n) = A(k+1:n,k+1:n) - A(k+1:n,k)* A(k,k+1:n) end forNote that the first step picks out the largest element in the k-th column among the remaining unreduced part of the matrix. Also, the loop index runs to n, so that piv(n) will be defined - of course, that is just n (why?). Also, note that we swap the entire rows k and p of the array A; this will also swap the corresponding parts of L and U since we are overwriting A with those factors. It is another magic feature of LU factorization that this makes the factors come out "in the right order" as well.
Next, here is a matrix-vector product formulation with pivoting, but stated for an m by n matrix B, with m >= n.
for k = 1:n-1 B(k:m,k) = B(k:m,k) - B(k:m,1:k-1)*B(1:k-1,k) p = index of max element in |B(k:m,k)| piv(k) = p swap rows k and p of B: B(k,:) <---> B(p,:) B(k,k+1:n) = B(k,k+1:n) - B(k,1:k-1)*B(1:k-1,k+1:n) B(k+1:m,k) = B(k+1:m,k)/B(k,k) end for B(n:m,n) = B(n:m,n) - B(n:m,1:n-1)*B(1:n-1,n) B(n+1:m,n) = B(n+1:m,n)/B(n,n)Note the last two steps are new ingredients - they consist of applying the remaining updates to the rest of the matrix. Also note that we can do another pivot step in case B(n,n) is small.
Moving from the rank-1 update version of LU factorization to a matrix-vector version was done to get a better BLAS-2 kernel (in terms of load/store analysis) as the workhorse. But since the best BLAS kernel is matrix-matrix multiplication, we should try to get a BLAS-3 version. This can be done by blocking.
Blocked versions of LU factorization can be derived in the same way as was done to get the matrix-vector product version. Partition the matrix into block columns, each of width v. We can find v columns of L and v rows of U on each "iteration". Suppose that k columns/rows have already been found. The picture is

where the unknowns are shown in red and the known values in blue. This is the same as the last diagram, but with block columns replacing the unknown row/column to find at this step. In more detail showing the sizes and shapes of the systems,

Sequencing the known/unknowns in order gives:


The precise algorithm statement is left as an unassigned exercise. You should implement your resulting algorithm in Matlab to verify its correctness.
The reason for using a matrix-vector product earlier was that it gave better computational kernels. However, when using a blocked version the rank-1 update becomes a rank-v update and so becomes matrix-matrix multiplication. That kernel is the best amongst the BLAS in terms of its ratio of memory references to flops, so we can safely use it in a high-performance version of LU factorization, instead of the version above based on the blocked matrix-vector version. This also gives you a good review question to make sure you have understood things so far: write out the block version of the matrix*vector based decomposition.
So partition the matrix into block columns, each of width v. The last block column, however, may be of smaller width if v does not divide n evenly. For that reason the following algorithm sets up vb as the actual block size of the current block column.
Another point: as with the block matrix-vector (which is really a matrix-matrix version as well) shown above, the step following the scaling of subdiagonal entries consists of multiplying the following rows of the unreduced parts of A using those scaled entries. When we go to a block version, the equivalent is to apply a lower triangular L solve on the corresponding block row, as happened above. That is indicated below by the Matlab tril(*,-1) operator returning the strictly lower triangular part of its argument. That is added to the identity matrix since L is unit lower triangular. The triangular solve is applied with overwriting to the entire block row.
Note that the second step is simply a call to LU factorization with pivoting, applied to the rectangular matrix A(k:n,k:k+b-1). This is why we stated the matrix-vector product form of LU factorization above for a rectangular matrix - so it could be used as a function here. Step four is two calls to the BLAS routine dswap(), step 5 a call to dtrsm(), and step 6 a call to dgemm().
Before diving in and trying to implement the above algorithm, you should first draw the corresponding block pictures as was done above for the other versions.
Finally, all the above does is to produce the LU factorization. A complete solver will need to also do the triangular solves (trivially done with a BLAS library: two calls to dtrsv() ). And what do we do if there is a zero or very small pivot entry encountered? Quit, keep going, go to full pivoting? In terms of implementing a scientific computing library, you should return to the user enough information so that the user can make that difficult decision - not your code.