Memory Hierarchy and Data Locality
Generally, the limiting factor in numerical computations is not the
processor rate, but the ability to get data to and from the memory system.
This in turn is because large memory systems are needed for modern applications,
but larger memory systems require longer to supply a datum than small ones
do. Example:
-
A machine runs at 2 Ghertz and has a theoretical peak computation
rate of 4 Gflops/sec (the extra factor of 2 comes from the common case of a
floating point unit carrying out an add and a muliply in one clock cycle).
-
The vector update operation, otherwise knows as saxpy (or daxpy) is
usually implemented via:
for k = 1:n
y(k) = y(k) + a*x(k)
end for
has to read two operands x(k) and y(k) per loop iteration, and write one
operand y(k).
With three memory accesses per two operations, the amount of data needed
to be supplied to maintain the full theoretical computation rate is
6 Gwords/sec = 48 Gbytes per second.
In 2007, a fast memory system has a "bus width" of 8 bytes (conveniently, the
size of a double precision word), and a fast bus speed would be 800 to 1667 Mhertz.
Suppose it runs at 1000 Mhertz.
That means the memory system can supply one double every 1/1000 microsecond,
or a word every nanosecond.
[These particular numbers will continue to grow, but the issue likely
will still remain.]
So the operation above needs 6 Gwords/sec, but the memory only supplies data at a rate
of 1 Gword/sec, 17% of the needed rate.
This mismatch is described as the "memory wall" or
processor-memory
bottleneck.
Some specialized hardware handles this memory bottleneck in several ways.
Multiple interleaved memory banks duplicate the memory
in banks, and successive words in memory are spread across the banks. If
e.g., there are 4 memory banks, while memory bank 1 is supplying word 1,
memory bank i can start accessing word i, for i = 2, 3, 4. Word 5 is located
in bank 1, and so forth. If multiple memory buses are used, this could make
memory look 4x faster.
Problem: what if we need to access every fourth
word, as occurs in accessing a matrix by columns (in C), where the number
of rows is a multiple of 4? Then all the operands come from one bank, and
you are back down to 17% of the needed rate.
Slight digression:
Just about all systems you buy for laptop or desktop do have multiple memory
banks, and all of them use a single bus, nowadays driven at 133 Mhertz
rather than 800 Mhertz. So the situation is not improved much, since it is
the bus that is the bottleneck.
Vector machines such as the Cray, NEC and Fujitsu supercomputers typically
have many memory banks,
on the order of hundreds. Avoiding bank conflicts,
where consecutive memory references are not in different memory banks,
is crucial for performance on those machines. However (unfortunately
according to many application scientists) vector machines are not readily
available in the U.S., since the only manufacturer (Cray) has just recently
re-entered the vector processor with the
Cray X1.
[For amusement, you may also want to look at the
ancient history Cray 1 system.]
Using memory banks is now well-embedded, and performance hits from bank
conflicts are minimal on non-vector machines. Sidenote: all modern processors
are pipelined, for memory access, instruction execution, and operations.
"Vector machine" here refers to deeply pipelined machines, sometimes with
no cache.
Caches:
A more generally useful technique is to use a memory hierarchy,
which is done on every modern general purpose processor (including
some vector machines).
Typical Workstation Memory Hierarchy
The picture on modern HPC (high-performance computing) machines is
much more complicated: as well as multiple levels of cache there
may be some main memory located further away than other parts of
main memory. However, the
ideas behind using a single cache are the same basic ones that underly
using the memory hierarchy in general: once you have gone to the expense
of moving a datum to a faster part of the hierarchy, do all you can with
it before it has to move back to a slower part. This is one form of
data re-use.
Cache Modeling
Computers have an intrinsic clock that sends out signals to all components,
keeping them in synchronization. Each time interval is called a clock
cycle.
-
When memory reference is to datum already in cache, the access is called
a cache hit.
-
When memory reference is not in cache, the access is called a
cache miss.
-
Cache hits can supply datum in 1-2 clock cycles; cache misses typically
take 10-50 cycles.
-
To add in spatial data locality,
instead of bringing in single word bring in a line of words,
usually 4-32 words. Suppose the cache line size is 4 words; referencing any of memory addresses
0-3 will cause all of them to be brought into cache. [Notice that for
memory addresses, things are indexed from 0. For mathematical entities, they
start from 1. The indexing in a code depends on the programming language:
C/C++ or Java index from 0, while Fortran can index from any integer,
including negative numbers. Typically I'll use 1-based indexing for
consistency, except for computer hardware.]
-
Where does a line go in cache? Some hardware restricts locations in the
cache to which a line can be written. Also,
when does it get moved out? Since the cache is much smaller than main
memory, eventually it will get full and further memory accesses will
require moving a line out of the cache to make room for the new
line being referenced.
-
Various schemes exist for where in cache the line goes (direct mapped,
set associative, fully associative). This is not critically important in
scientific computing, generally because a poor policy simply makes the
cache appear smaller than it really is.
Cache misses can occur in three ways:
-
This is the first time the CPU asked for any data from its cache block.
These cache misses are called compulsory, cold-start, or
first
reference misses. They can occur also when your job gets swapped out;
then the data in the cache on your return typically belongs to someone
else's job, and your program will take time to reload it via references
to data.
-
The item was in the cache earlier but later it was replaced by other data
because the cache has insufficient capacity. These cache misses are capacity
misses.
-
The item was in the cache earlier but later it was replaced by other data
mapped into the same block or set. These misses are called conflict
or collision misses.
An example of a cache hit:
![[cache.gif]](cache.gif)
Suppose that the first three hexadecimal digits of an address form the
tag for a cache line, and that the fourth and last hex digit specifies
where in the line that item is found. A reference to address 0132
above is a hit since its tag (013) appears. A reference to address
6317 in the above picture is a cache miss (tag not in cache). When
this happens,
-
Cache makes call to memory system to provide datum.
-
When waiting, cache moves a line out to make room.
Cache line replacement policies: Which line gets moved out when
a another one needs to move into the cache? Common policies are
-
Random - just grab one and throw it back to memory
-
First In First Out - move out the one which was earliest into the cache.
-
Least Recently Used - move out the one which has not been used in the longest
time.
Which replacement policy is likely to be "best"? Since caches
form their own hierarchy with two or three levels of caches between the
processor and main memory, it is common to have mixed strategies
within them. However, most machines tend to use LRU for their highest-level
(and largest) cache.
Effective access time to memory depends on the cache hit ratio
ch, which is the ratio of number of references to data in cache to
the total number of memory references.
teff = ch * tcache + (1-ch) * tmain
Typically, tmain = 10*tcache. Even though this is a
linear model, notice that small changes
in ch can have a big effect on overall performance. Consider
going from 0.99 to 0.98 to 0.89:
teff = 0.99 tcache + 0.01 tmain = 1.09 tcache
teff = 0.98 tcache + 0.02 tmain = 1.18 tcache
teff = 0.89 tcache + 0.11 tmain = 1.99 tcache
A drop of 1% in cache hits causes a 9% drop in the effective access time.
A drop of 10% leads to a doubling in the effective access time. This can
make a 320 Mhertz processor look like it is running at 160 Mhertz.
Cache associativity: Where does a line go when it is moved to
the cache?
-
Direct mapped: every line has only one slot in cache it can go to. In this
case, the replacement policy is moot.
-
Fully associative: every line can go into any slot.
-
Set Associative: "four-way", e.g., is divided into N sets, each with four
lines. A line can go into any of those four lines.
Update policy: When does a datum move from cache to memory?
-
When its line is replaced (write-back)
-
Every time it is changed (write-through)
Which of those two is likely to be faster? Why have the other one?
Other issues with caches:
-
Multilevel caches. Several systems, the IA64's in particular, have two or
three levels of cache. Which one (primary or secondary) determines performance
depends on their relative sizes and replacement policies? My claim is
that typically it is the first one that is outside the integrated chip
that the CPU is on - but I'm willing to be proven wrong. Proving me wrong
means via code performance tests, not vendor's documentation.
-
Separate caches for instructions and data. Most systems do separate out
these two streams - since the program instructions are read from memory
just like any other data, the same tricks can be applied there. Interestingly,
most scientific and engineering applications do not have much instruction
data traffic compared to the regular data traffic.
-
Interaction with I/O system: what if we replace a virtual memory page in
memory, but it has an updated value in the cache? Then the relevant values
(the ones on that page) in cache must
be "flushed", written back before the replacement occurs.
-
Virtual memory works similarly to caches. However, ratio of access times
is ~500, not 10. Page faults are the equivalent of cache
misses in the virtual memory system.
Effects of Cache: Performance of Three Kernel Operations
Consider some examples of cache effects. First, Mflops/second versus vector size
for the operations
-
saxpy: y = y + alpha*x where x and y are vectors of length n and alpha
is a scalar.
-
dotpxy: alpha = xTy, where x and y are vectors of length n.
-
dotpxx: alpha = xTx, where x is a vector of length n.
What is the size of the cache in the test plotted above? Why does one of the
operations seem to have high performance for longer vector lengths than
the other two?
The above plot allows us to extract the effective cache size.
With modern compilers
changing optimization settings can give vastly different results.
but the general rule still holds: minimze the number of memory references.
Second Example of Cache Effects: matrix accesses for matrix-matrix
multiply. Operation is C = C + A*B, where A,B, and C are n x n matrices.
The standard way of writing the algorithm is
for i = 1:n
for j = 1:n
temp = C(i,j)
for k = 1:n
temp = temp + A(i,k)*B(k,j)
end
C(i,j) = temp
end
end
Trace through the memory access pattern in executing this algorithm, assuming
matrices are stored column-by-column (as they are in Matlab and Fortran).
C programmers: don't get cocky - you have the same number of cache misses
in C ... they just appear in a different order!
Getting good data locality is a matter of analyzing the loads and stores
needed by your program. That will be illustrated with some simple linear
algebra operations later.
Next page: pipelining
- Started: 17 Aug 1995
- Updated: Fri Jan 30 10:26:42 EST 2004
- Updated: Sun 09 Sep 2007
- Updated: 13 Oct 2008, modernizing some of the flop rates.
- Modified: Wed 16 Sep 2009, 08:00 PM to size images.
- Last Modified: Wed 16 Sep 2009, 08:00 PM