Memory Hierarchy and Data Locality


Generally, the limiting factor in numerical computations is not the processor rate, but the ability to get data to and from computational units. This happens at multiple levels: from CPU to caches, from caches to main memory system, from main memory to hard drives, between computers connected by a network, all the way to data collection systems requiring manual data entry. The level emphasized in P573 is between a cache and main memory, but the principles are readily extended to other mismatches in data access time.


Memory sizes

Modern applications require large amounts of memory but larger memory systems require longer to supply a datum than small ones do. As an example, suppose

This mismatch is called the "memory wall" or the processor-memory bottleneck.


Memory systems

Specialized hardware handles this data access bottleneck in several ways.

Multiple interleaved memory banks duplicate the memory in banks, and successive words in memory are spread across the banks. If, e.g., there are 4 memory banks, while memory bank 1 is supplying word 1, memory bank i can start accessing word i, for i = 2, 3, 4. Word 5 is located back in bank 1, and so forth. If multiple memory buses are used, this can make memory look 4x faster. This is the concept underlying "quad channel memory" in PC's, and the reason why memory DIMMS usually have to be installed in pairs with the same timings.

In a 4-bank system, what if every fourth word in a memory stream is required, as occurs in accessing a matrix by columns (in C), where the number of rows is a multiple of 4? Then all the operands come from one bank, and performance is back down to 9% of the needed rate. [Actually, even lower speeds are common for a 4-bank system, because a single bank typically takes longer to provide a second datum after the first is provided.]

Vector machines such as Cray, NEC, and Fujitsu supercomputers typically have many memory banks, on the order of hundreds. Avoiding bank conflicts, where consecutive memory references are not in different memory banks, is crucial for performance on those machines. However (unfortunately according to many application scientists) vector machines are not as readily available in the U.S. as they once were. The wikipedia article on the Cray X1 vector processor is good for more info about relatively modern Cray vector machines. (A simpler system to consider when learning about vector processing is the ancient Cray 1 system.)

Using memory banks is now well-embedded, and performance hits from bank conflicts are relatively minimal on non-vector machines. Sidenote: all modern processors are pipelined, for memory access, instruction execution, and operations. "Vector machine" here refers to deeply pipelined machines, sometimes with no cache.


Caches:

A more generally useful technique is to use a memory hierarchy, which is implemented on every modern general purpose processor, including some vector machines. The ideas are


A Simplified Workstation Memory Hierarchy

[hier.gif]

The picture on modern HPC (high-performance computing) machines is much more complicated: as well as multiple levels of cache there may be some main memory located further away than other parts of main memory. However, the ideas behind using a single cache are the same basic ones that underly using a memory hierarchy in general: once you have gone to the expense of moving a datum to a faster part of the hierarchy, do all you can with it before it has to move back to a slower part. This is one form of data re-use.


Cache analysis

Computers have an intrinsic clock that sends out signals to all components, keeping them in synchronization. Each time interval is called a clock cycle. This clock is not the same as the one you can invoke from a C++ or Fortran code. Instead it is a heartbeat equal to the inverse of the Ghertz rating of the CPU; e.g., 3.2 Ghertz corresponds to a clock cycle time of 31.25 nanoseconds. Cache misses can occur in three ways:
  1. This is the first time the CPU asked for any data from its cache block. These cache misses are called compulsory, cold-start, or first reference misses. They can occur also when your job gets swapped out; then the data in the cache on your return typically belong to someone else's job, and your program will take time to reload it via references to data.
  2. The item was in the cache earlier but later it was replaced by other data because the cache has insufficient capacity. These cache misses are capacity misses.
  3. The item was in the cache earlier but later it was replaced by other data mapped into the same block or set. These misses are called conflict or collision misses.
An example of a cache hit:

[cache.gif]

Suppose that the first three hexadecimal digits of an address form the tag for a cache line, and that the fourth and last hex digit specifies where in the line that item is found.  A reference to address 0132 above is a hit since its tag (013) appears.  A reference to address 6317 in the above picture is a cache miss (tag not in cache).  When this happens,

Cache line replacement policies: Which line gets moved out when a another one needs to move into the cache? Common policies are Which replacement policy is likely to be "best"?   Since caches form their own hierarchy with two or three levels of caches between the processor and main memory,  mixed strategies are common. However, most machines tend to use LRU for their highest-level (and largest) cache.


A model for memory access times

The effective access time teff to memory depends on the cache hit ratio ch, which is the ratio of number of references to data in cache to the total number of memory references. The basic model for effective access time is:

teff = ch * tcache + (1-ch) * tmemory

Typically, tmemory = 10*tcache. Even though this is a linear model, notice that small changes in ch can have a big effect on overall performance. When going from a cache hit ratio 0.99 to 0.98 to 0.89:

teff = 0.99 tcache + 0.01 tmemory = 1.09 tcache
teff = 0.98 tcache + 0.02 tmemory = 1.18 tcache
teff = 0.89 tcache + 0.11 tmemory = 1.99 tcache

A drop of 1% in cache hits causes a 9% drop in the effective access time. A drop of 10% leads to a doubling in the effective access time. This can make a 3.2 Ghertz processor look like it is running at 1.6 Ghertz.

In reality, the ratio of tmemory to tcache is more likely to be 50, not 10. In that case a cache hit ratio of 99% leads to a 50% drop in effective access time. A cache hit ratio of 67% (about what a naive implementation of a matrix-matrix multiply operation would get) leads to teff = 17.2 tcache, making that 3.2 Ghertz processor seem like one running at 186 Mhertz. A 200 Mhertz processor is the fastest one available ... in 1995.


More about cache policies

Cache associativity: Where in the cache does a line go when it is moved to the cache?

Update policy: When does a datum get move from cache to memory? Which of those two is likely to be faster? Why have both of them?

Other issues with caches:


Effects of Cache: Performance of Three Kernel Operations

Cache effects can be seen by timing vector operations for varying lengths of vectors. Because time will increase as the length increases, plotting the rate (in Mflops/second) versus vector size makes those effects more easily visible. The following plots are for the operations
[maya2_O3.png]

The plot above allows extracting the effective cache size. What is the cache size of the machine that generated the data? Why do two of the operations have high performance for longer vector lengths than the other one? With modern compilers changing optimization settings can give vastly different results. but the general rule still holds: minimze the number of memory references.

Second example of cache effects: matrix accesses for matrix-matrix multiply. Operation is C = C + A*B, where A, B, and C are n x n matrices contained in 2d arrays. The standard way of writing the algorithm is

  for i = 1:n
     for j = 1:n
        temp = C(i,j)
        for k = 1:n
           temp = temp + A(i,k)*B(k,j)
        end 
        C(i,j) =  temp 
     end
  end
Trace through the memory access pattern in executing this algorithm, assuming matrices are stored column-by-column (as they are in Matlab and Fortran). C++ programmers: don't get cocky - you have the same number of cache misses in C++, they just appear in a different order.

Getting good data locality is a matter of analyzing the loads and stores needed by your program. That will further illustrated with some simple linear algebra operations, but the principles are applicable to more than that.

Historically the relative performances of the vector operations has not changed. Here's a graph of results from an SGI Power Challenge circa 1994:

[cachesize.gif]

At the time, the Power Challenge was considered a "supercomputer", and indeed the performance rates are breath-taking: about 10% of what a workstation provided in 2011, eleven computer generations later (assuming 18 months = one generation).


Next: pipelining