When A is stored as a dense 2D n×n array, any potential fill-in space is already provided. For sparse matrices, the data structure must be modified unless the storage and row/column indices are pre-allocated in advance. So one feature of a data structure for a sparse matrix is how easily new entries can be added to existing rows.
Now consider applying LU factorization without pivoting to the matrix A. At the first step, A(1,1) is used to zero out A(2,1) by adding a multiple of row 1 to row 2. Doing this causes new nonzeros to be introduced in the rest of row 2, in entries A(2,3:n) shown by a red F below:
The new nonzeros introduced are indicated with a red "F", for fill-in. Next, A(1,1) is used to zero out A(3,1), introducing new nonzeros in entries A(3,2) and A(3,4:n):
Continuing this process, by the time A(1,1) has been used to zero out A(n,1), the entire array has all nonzeros in it:
Don't forget that the first column is still nonzero, since it is used to store the multipliers used in zeroing out the corresponding entries. So in this case, a sparse matrix requiring only 3n-2 entries to be stored, has LU factors that require storing n2 entries.
Renumbering the equations this way moves rows in the matrix - in this case, it interchanges the first and last rows. Next, renumber the unknowns so that the first unknown x(1) becomes the last unknown x(n), and the last unknown becomes the first. Renumbering the unknowns this way causes the columns of A to be reordered, in this case interchanging the first and last columns. The new nonzero pattern becomes
Now perform LU factorization without pivoting on this matrix - what is the resulting nonzero sparsity pattern for the L and U factors? How much fill-in occurs?
A large amount of research has gone into matrix reordering methods - originally to reduce the amount of fill-in, and now to include goals such as getting a sparsity pattern amenable to parallel computation. Matrix reordering corresponds to the labeling of nodes in a related graph from 1 to n. Different ways of labeling the graph nodes leads to different matrix sparsity patterns, and most of the algorithms are formulated in graph algorithm terminology.
For a regular N x N mesh in 2D, the cost becomes O(N3). Note O(N4) needed when treating A as a full (dense) array. Also, no reordering of PDE unknowns will give a smaller order of storage.
For a 3D uniform mesh, the cost is O(N5). This is the crucial reason 3D is such a challenge to scientific computing. For a 1 Gybte memory machine, the effective limit for 2D problems is a 11000 x 11000 mesh; for 3D it is a 500 x 500 x 500 mesh. That means we can resolve features 22 times more accurately on a 2D mesh than on a 3D one. In practice, modern scientific computing methods don't use uniform meshes, and much research concentrates mesh points in areas of "interest", where gradients are large and the physics of the problem is rapidly changing. However, the dimensionality problem still means 3D is more difficult. Sidenote: a great deal of research also has been done on out of core methods for doing LU factorization and solves on large sparse systems of equations. These methods use sophisticated staging methods to handle problems larger than will fit into memory at one time. However, they are still much slower than problems that (mostly) fit into memory at one time.
The above estimates are for square/cubic domains. Practical problems often involve more complicated geometries for which reordering unknowns may or may not provide reasonable direct solutions. Rule of thumb for direct systems: direct methods are best on most modern workstations if order of A is n <= 12000; for larger machines n <= 20000. Those number will change dramatically as algorithms, implementations, and memory systems improve, however, so treat them as rough guidelines circa 2012. Sparse direct solvers can handle some hugh systems readily, yet choke on systems an order of magnitude smaller. It depends on the fill-in encountered, which is not predictable without first carrying out "symbolic factorization", as was done on the original coefficient matrix above. We just mimicked the operations of LU factorization without actually carrying out any floating point arithmetic, to see where new nonzeros would appear. In general that requires some complex (integer) computation and graph algorithms.
One rule does hold: You should never write a direct solver for linear systems. Use existing packages, because they will invariably do a better job than you can. Some to consider using are:
In general, nobody is really interested in solving linear systems: they want to solve PDE's, optimization problems, circuit simulation, etc. Many software packages for those problems have specialized linear solvers that take advantage of special characteristics from the application. This is particularly the case for iterative methods, and designing effective preconditioners for them.
Some other general scientific computing software to consider using are: