Memory Hierarchy and Data Locality


Generally, the limiting factor in numerical computations is not the processor rate, but the ability to get data to and from the memory system. This in turn is because large memory systems are needed for modern applications, but larger memory systems require longer to supply a datum than small ones do. Example: This mismatch is described as the "memory wall" or processor-memory bottleneck.

Some specialized hardware handles this memory bottleneck in several ways.

Multiple interleaved memory banks duplicate the memory in banks, and successive words in memory are spread across the banks. If e.g., there are 4 memory banks, while memory bank 1 is supplying word 1, memory bank i can start accessing word i, for i = 2, 3, 4. Word 5 is located in bank 1, and so forth. If multiple memory buses are used, this could make memory look 4x faster. Problem: what if we need to access every fourth word, as occurs in accessing a matrix by columns (in C), where the number of rows is a multiple of 4? Then all the operands come from one bank, and you are back down to 17% of the needed rate. Slight digression: Just about all systems you buy for laptop or desktop do have multiple memory banks, and all of them use a single bus, nowadays driven at 133 Mhertz rather than 800 Mhertz. So the situation is not improved much, since it is the bus that is the bottleneck.

Vector machines such as the Cray, NEC and Fujitsu supercomputers typically have many memory banks, on the order of hundreds. Avoiding bank conflicts, where consecutive memory references are not in different memory banks, is crucial for performance on those machines.  However (unfortunately according to many application scientists) vector machines are not readily available in the U.S., since the only manufacturer (Cray) has just recently re-entered the vector processor with the Cray X1. [For amusement, you may also want to look at the ancient history Cray 1 system.] Using memory banks is now well-embedded, and performance hits from bank conflicts are minimal on non-vector machines. Sidenote: all modern processors are pipelined, for memory access, instruction execution, and operations. "Vector machine" here refers to deeply pipelined machines, sometimes with no cache.


Caches:

A more generally useful technique is to use a memory hierarchy, which is done on every modern general purpose processor (including some vector machines).

Typical Workstation Memory Hierarchy

[hier.gif]

The picture on modern HPC (high-performance computing) machines is much more complicated: as well as multiple levels of cache there may be some main memory located further away than other parts of main memory. However, the ideas behind using a single cache are the same basic ones that underly using the memory hierarchy in general: once you have gone to the expense of moving a datum to a faster part of the hierarchy, do all you can with it before it has to move back to a slower part. This is one form of data re-use.


Cache Modeling

Computers have an intrinsic clock that sends out signals to all components, keeping them in synchronization. Each time interval is called a clock cycle. Cache misses can occur in three ways:
  1. This is the first time the CPU asked for any data from its cache block. These cache misses are called compulsory, cold-start, or first reference misses. They can occur also when your job gets swapped out; then the data in the cache on your return typically belongs to someone else's job, and your program will take time to reload it via references to data.
  2. The item was in the cache earlier but later it was replaced by other data because the cache has insufficient capacity. These cache misses are capacity misses.
  3. The item was in the cache earlier but later it was replaced by other data mapped into the same block or set. These misses are called conflict or collision misses.
An example of a cache hit:

[cache.gif]

Suppose that the first three hexadecimal digits of an address form the tag for a cache line, and that the fourth and last hex digit specifies where in the line that item is found.  A reference to address 0132 above is a hit since its tag (013) appears.  A reference to address 6317 in the above picture is a cache miss (tag not in cache).  When this happens,

Cache line replacement policies: Which line gets moved out when a another one needs to move into the cache? Common policies are Which replacement policy is likely to be "best"?   Since caches form their own hierarchy with two or three levels of caches between the processor and main memory,  it is common to have mixed strategies within them.  However, most machines tend to use LRU for their highest-level (and largest) cache.

Effective access time to memory depends on the cache hit ratio ch, which is the ratio of number of references to data in cache to the total number of memory references.

teff = ch * tcache + (1-ch) * tmain

Typically, tmain = 10*tcache. Even though this is a linear model, notice that small changes in ch can have a big effect on overall performance. Consider going from 0.99 to 0.98 to 0.89:
teff = 0.99 tcache + 0.01 tmain = 1.09 tcache
teff = 0.98 tcache + 0.02 tmain = 1.18 tcache
teff = 0.89 tcache + 0.11 tmain = 1.99 tcache
A drop of 1% in cache hits causes a 9% drop in the effective access time. A drop of 10% leads to a doubling in the effective access time. This can make a 320 Mhertz processor look like it is running at 160 Mhertz.

Cache associativity: Where does a line go when it is moved to the cache?

Update policy: When does a datum move from cache to memory? Which of those two is likely to be faster? Why have the other one?

Other issues with caches:


Effects of Cache: Performance of Three Kernel Operations

Consider some examples of cache effects. First, Mflops/second versus vector size for the operations
[maya2_mkl.png]

What is the size of the cache in the test plotted above? Why does one of the operations seem to have high performance for longer vector lengths than the other two?

The above plot allows us to extract the effective cache size. With modern compilers changing optimization settings can give vastly different results. but the general rule still holds: minimze the number of memory references.

Second Example of Cache Effects: matrix accesses for matrix-matrix multiply. Operation is C = C + A*B, where A,B, and C are n x n matrices. The standard way of writing the algorithm is

  for i = 1:n
     for j = 1:n
        temp = C(i,j)
        for k = 1:n
           temp = temp + A(i,k)*B(k,j)
        end 
        C(i,j) =  temp 
     end
  end
Trace through the memory access pattern in executing this algorithm, assuming matrices are stored column-by-column (as they are in Matlab and Fortran). C programmers: don't get cocky - you have the same number of cache misses in C ... they just appear in a different order!

Getting good data locality is a matter of analyzing the loads and stores needed by your program. That will be illustrated with some simple linear algebra operations later.


Next page: pipelining