P573: Memory Hierarchy

Memory Hierarchy and Data Locality

Generally, the limiting factor in numerical computations is not the processor rate, but the ability to get data to and from computational units. This happens at multiple levels: from CPU to caches, from caches to main memory system, from main memory to hard drives, between computers connected by a network, all the way to data collection systems requiring manual data entry. The level emphasized in P573 is between a cache and main memory, but the principles are readily extended to other mismatches in data access time.

Memory sizes

Modern applications require large amounts of memory but larger memory systems require longer to supply a datum than small ones do. As an example, suppose

A machine runs at 3.4 Ghertz and has a theoretical peak computation rate of 6.8 Gflops/sec. [The extra factor of 2 comes from the common case of a floating point unit having a "fused multiply-add" unit, capable of carrying out one add and a multiply in one clock cycle.]
The "vector update operation", otherwise known as saxpy (or daxpy) is usually implemented via:

         for k = 1:n
             y(k) = y(k) + a*x(k)
         end for

With three memory accesses per two operations, the amount of data needed to be supplied to maintain the full theoretical computation rate is 20.4 Gwords/sec = 163.2 Gbytes per second.
A memory system might have a bus width of 8 bytes (conveniently, the size of a double precision word), and a typical bus speed would be 1667 Mhertz. That means the memory system can supply one double precision word every 1/1667 microsecond, or 1.667 words every nanosecond.
The daxpy operation above requires 20.4 Gwords/sec, but the memory system supplies data at a rate of only 1.667 Gword/sec, about 9% of the needed rate. These particular numbers will change with time, but the underlying, fundamental mismatch likely will still remain.

This mismatch is called the "memory wall" or the processor-memory bottleneck.

Memory systems

Specialized hardware handles this data access bottleneck in several ways.

Multiple interleaved memory banks duplicate the memory in banks, and successive words in memory are spread across the banks. If, e.g., there are 4 memory banks, while memory bank 1 is supplying word 1, memory bank i can start accessing word i, for i = 2, 3, 4. Word 5 is located back in bank 1, and so forth. If multiple memory buses are used, this can make memory look 4x faster. This is the concept underlying "quad channel memory" in PC's, and the reason why memory DIMMS usually have to be installed in pairs with the same timings.

In a 4-bank system, what if every fourth word in a memory stream is required, as occurs in accessing a matrix by columns (in C), where the number of rows is a multiple of 4? Then all the operands come from one bank, and performance is back down to 9% of the needed rate. [Actually, even lower speeds are common for a 4-bank system, because a single bank typically takes longer to provide a second datum after the first is provided.]

Vector machines such as Cray, NEC, and Fujitsu supercomputers typically have many memory banks, on the order of hundreds. Avoiding bank conflicts, where consecutive memory references are not in different memory banks, is crucial for performance on those machines. However (unfortunately according to many application scientists) vector machines are not as readily available in the U.S. as they once were. The wikipedia article on the Cray X1 vector processor is good for more info about relatively modern Cray vector machines. (A simpler system to consider when learning about vector processing is the ancient Cray 1 system.)

Using memory banks is now well-embedded, and performance hits from bank conflicts are relatively minimal on non-vector machines. Sidenote: all modern processors are pipelined, for memory access, instruction execution, and operations. "Vector machine" here refers to deeply pipelined machines, sometimes with no cache.

Caches:

A more generally useful technique is to use a memory hierarchy, which is implemented on every modern general purpose processor, including some vector machines. The ideas are

Use normal big, slow, cheap parts (main memory), combined with small, fast, expensive parts (cache).
Registers are a few (32-256 or so) memory locations located on the CPU with access time effectively 0 seconds.
The reason this is useful: data locality characteristic of programs. Random access memory is generally not accessed in completely random way; addresses accessed tend to be localized around ones that were recently accessed. In the daxpy example above, the loop marches through the vectors x and y in order, so x(k) is systematically accessed after x(k-1), etc. This is called spatial data locality .
Another form of data locality is temporal data locality: if a datum was used recently, then it is likely to be reused soon again. For example, consider a loop used for a dotproduct of two vectors x and y:
```
    alpha = 0.0
    for k = 1:n
        alpha = alpha + x(k)*y(k)
    end for
```
The scalar alpha gets used repeatedly. In any modern system it is initialized in a register in the CPU and then the only traffic to memory related to it is at the end of the loop, when the final value is written out to the memory system. Being able to use frequently used data temporally closer to CPU is the most important optimization technique possible. Sometimes this is stated as "enhancing data locality".
A cache is small but ultrafast memory (32 Kbytes to 32 Mbytes) that stores recently used data, in the hope that it will be soon used again. Some systems have larger second and third level caches and almost all systems now have 3-4 levels of cache. For most codes, however, only one of the cache levels will be the bottleneck, typically the first one that is off-die from the processor.

A Simplified Workstation Memory Hierarchy

The picture on modern HPC (high-performance computing) machines is much more complicated: as well as multiple levels of cache there may be some main memory located further away than other parts of main memory. However, the ideas behind using a single cache are the same basic ones that underly using a memory hierarchy in general: once you have gone to the expense of moving a datum to a faster part of the hierarchy, do all you can with it before it has to move back to a slower part. This is one form of data re-use.

Cache analysis

Computers have an intrinsic clock that sends out signals to all components, keeping them in synchronization. Each time interval is called a clock cycle. This clock is not the same as the one you can invoke from a C++ or Fortran code. Instead it is a heartbeat equal to the inverse of the Ghertz rating of the CPU; e.g., 3.2 Ghertz corresponds to a clock cycle time of 31.25 nanoseconds.

When a memory reference is to datum already in cache, the access is called a cache hit.
When a memory reference is not in cache, the access is called a cache miss.
Cache hits can supply datum in 1-2 clock cycles; cache misses typically take 10-50 cycles.
To take advantage of any spatial data locality, instead of bringing in single word bring in a line of words, usually 4-32 words. Suppose the cache line size is 4 words; referencing any of memory addresses 0-3 will cause all of them to be brought into cache.
Slight digression: memory addresses are indexed from 0. For mathematical entities, they usually start from 1. The indexing scheme in a code depends on the programming language: C/C++ or Java index from 0, Matlab starts with 1, while Fortran can index from any integer, including negative integers. For mathematical operations and arrays, typically 1-based indexing will be used, which matches most algorithm statements. This rule holds except for computer hardware, and some signal processing algorithms like FFTs (Fast Fourier Transforms).
Where does a line go in cache? Some hardware restricts locations in the cache to which a line can be written. Also, when does it get moved out? Since the cache is much smaller than main memory, eventually it will get full and further memory accesses will require moving a line out of the cache to make room for the new line being referenced.
Various schemes exist for where in cache the line goes (direct mapped, set associative, fully associative). This is not critically important in scientific computing, generally because a poor policy simply makes the cache appear smaller than it really is.

Cache misses can occur in three ways:

This is the first time the CPU asked for any data from its cache block. These cache misses are called compulsory, cold-start, or first reference misses. They can occur also when your job gets swapped out; then the data in the cache on your return typically belong to someone else's job, and your program will take time to reload it via references to data.
The item was in the cache earlier but later it was replaced by other data because the cache has insufficient capacity. These cache misses are capacity misses.
The item was in the cache earlier but later it was replaced by other data mapped into the same block or set. These misses are called conflict or collision misses.

An example of a cache hit:

[cache.gif]

Suppose that the first three hexadecimal digits of an address form the tag for a cache line, and that the fourth and last hex digit specifies where in the line that item is found. A reference to address 0132 above is a hit since its tag (013) appears. A reference to address 6317 in the above picture is a cache miss (tag not in cache). When this happens,

The cache makes a call to the memory system to provide datum
While waiting for the data to arrive, the cache moves a line out to make room. if the ousted line has been modified since it was loaded, it is written back to memory. Otherwise, it is overwritten by the newly required line

Cache line replacement policies: Which line gets moved out when a another one needs to move into the cache? Common policies are

Random: just grab one and throw it back to memory
FIFO (First In First Out): move out the one which was earliest into the cache.
LRU (Least Recently Used): move out the one which has not been used in the longest time.

Which replacement policy is likely to be "best"? Since caches form their own hierarchy with two or three levels of caches between the processor and main memory, mixed strategies are common. However, most machines tend to use LRU for their highest-level (and largest) cache.

A model for memory access times

The effective access time t_eff to memory depends on the cache hit ratio c_h, which is the ratio of number of references to data in cache to the total number of memory references. The basic model for effective access time is:

t_eff = c_h * t_cache + (1-c_h) * t_memory

Typically, t_memory = 10*t_cache. Even though this is a linear model, notice that small changes in c_h can have a big effect on overall performance. When going from a cache hit ratio 0.99 to 0.98 to 0.89:

t_eff = 0.99 t_cache + 0.01 t_memory = 1.09 t_cachet_eff = 0.98 t_cache + 0.02 t_memory = 1.18 t_cachet_eff = 0.89 t_cache + 0.11 t_memory = 1.99 t_cache

A drop of 1% in cache hits causes a 9% drop in the effective access time. A drop of 10% leads to a doubling in the effective access time. This can make a 3.2 Ghertz processor look like it is running at 1.6 Ghertz.

In reality, the ratio of t_memory to t_cache is more likely to be 50, not 10. In that case a cache hit ratio of 99% leads to a 50% drop in effective access time. A cache hit ratio of 67% (about what a naive implementation of a matrix-matrix multiply operation would get) leads to t_eff = 17.2 t_cache, making that 3.2 Ghertz processor seem like one running at 186 Mhertz. A 200 Mhertz processor is the fastest one available ... in 1995.

More about cache policies

Cache associativity: Where in the cache does a line go when it is moved to the cache?

Direct mapped: every line has only one slot in cache it can go to. In this case, the replacement policy is irrelevant.
Fully associative: every line can go into any slot.
Set associative: "four-way", e.g., is divided into N sets, each with four lines. A line can go into any of those four lines.

Update policy: When does a datum get move from cache to memory?

Write-back: when its line is replaced
Write-through: every time it is changed

Which of those two is likely to be faster? Why have both of them?

Other issues with caches:

Multilevel caches. Several systems, e.g., Intel's Core i7, have three or four levels of cache. Which one (primary, secondary, or tertiary) determines performance? That depends on their relative sizes and replacement policies. My claim is that typically it is the first one that is outside the integrated chip that the CPU is on - but I'm willing to be proven wrong. Proving me wrong means via code performance tests, not vendor's documentation.
Separate caches for instructions and data. Most systems separate these two streams. since the program instructions are read from memory just like any other data, the same tricks can be applied there. Interestingly, most scientific and engineering applications have little instruction data traffic compared to the regular data traffic.
Interaction with the I/O system - What if a virtual memory page is replaced in memory, but it has an updated datum in the cache? Then the relevant values (the ones on that page) in cache must be "flushed", written back before the replacement occurs.
Virtual memory works similarly to caches. However, ratio of access times is ~500, not 10. A page fault is the equivalent of a cache miss in the virtual memory system.

Effects of Cache: Performance of Three Kernel Operations

Cache effects can be seen by timing vector operations for varying lengths of vectors. Because time will increase as the length increases, plotting the rate (in Mflops/second) versus vector size makes those effects more easily visible. The following plots are for the operations

saxpy: y = y + alpha*x where x and y are vectors of length n and alpha is a scalar.
dotpxy: alpha = x^Ty, where x and y are vectors of length n.
dotpxx: alpha = x^Tx, where x is a vector of length n.

The plot above allows extracting the effective cache size. What is the cache size of the machine that generated the data? Why do two of the operations have high performance for longer vector lengths than the other one? With modern compilers changing optimization settings can give vastly different results. but the general rule still holds: minimze the number of memory references.

Second example of cache effects: matrix accesses for matrix-matrix multiply. Operation is C = C + A*B, where A, B, and C are n x n matrices contained in 2d arrays. The standard way of writing the algorithm is

  for i = 1:n
     for j = 1:n
        temp = C(i,j)
        for k = 1:n
           temp = temp + A(i,k)*B(k,j)
        end 
        C(i,j) =  temp 
     end
  end

Trace through the memory access pattern in executing this algorithm, assuming matrices are stored column-by-column (as they are in Matlab and Fortran). C++ programmers: don't get cocky - you have the same number of cache misses in C++, they just appear in a different order.

Getting good data locality is a matter of analyzing the loads and stores needed by your program. That will further illustrated with some simple linear algebra operations, but the principles are applicable to more than that.

Historically the relative performances of the vector operations has not changed. Here's a graph of results from an SGI Power Challenge circa 1994:

At the time, the Power Challenge was considered a "supercomputer", and indeed the performance rates are breath-taking: about 10% of what a workstation provided in 2011, eleven computer generations later (assuming 18 months = one generation).

Next: pipelining

Started: 17 Aug 2011
Last Modified: Mon 04 Nov 2019, 06:55 AM