Memory Hierarchy and Data Locality
Generally, the limiting factor in numerical computations is not the
processor rate, but the ability to get data to and from computational units.
This happens at multiple levels: from CPU to caches, from caches to main
memory system, from main memory to hard drives, between computers connected
by a network, all the way to data collection systems requiring manual data
entry. The level emphasized in P573 is between a cache and main memory, but
the principles are readily extended to other mismatches in data access time.
Memory sizes
Modern applications require large amounts of memory but larger memory
systems require longer to supply a datum than small ones
do. As an example, suppose
-
A machine runs at 3.4 Ghertz and has a theoretical peak computation
rate of 6.8 Gflops/sec. [The extra factor of 2 comes from the common case of a
floating point unit having a "fused multiply-add" unit, capable
of carrying out one add and a multiply in one clock cycle.]
-
The "vector update operation", otherwise known as saxpy (or daxpy) is
usually implemented via:
for k = 1:n
y(k) = y(k) + a*x(k)
end for
This has to read two operands x(k) and y(k) per loop iteration, and write one
operand y(k). The scalar a will be loaded once before the loop starts
and generally held in a register for the lifetime of the loop.
With three memory accesses per two operations, the amount of data needed
to be supplied to maintain the full theoretical computation rate is
20.4 Gwords/sec = 163.2 Gbytes per second.
A memory system might have a bus width of 8 bytes (conveniently, the
size of a double precision word), and a typical bus speed would be 1667 Mhertz.
That means the memory system can supply one double precision word every 1/1667 microsecond,
or 1.667 words every nanosecond.
The daxpy operation above requires 20.4 Gwords/sec, but the memory system supplies data
at a rate of only 1.667 Gword/sec, about 9% of the needed rate.
These particular numbers will change with time,
but the underlying, fundamental mismatch likely will still remain.
This mismatch is called the "memory wall" or the processor-memory bottleneck.
Memory systems
Specialized hardware handles this data access bottleneck
in several ways.
Multiple interleaved memory banks duplicate the memory
in banks, and successive words in memory are spread across the banks. If,
e.g., there are 4 memory banks, while memory bank 1 is supplying word 1,
memory bank i can start accessing word i, for i = 2, 3, 4. Word 5 is located
back in bank 1, and so forth. If multiple memory buses are used, this can make
memory look 4x faster.
This is the concept underlying "quad channel memory" in PC's, and the reason
why memory DIMMS usually have to be installed in pairs with the same timings.
In a 4-bank system,
what if every fourth word in a memory stream is required, as occurs in accessing a
matrix by columns (in C), where the number
of rows is a multiple of 4? Then all the operands come from one bank, and
performance is back down to 9% of the needed rate. [Actually, even lower speeds
are common for a 4-bank system, because a single bank typically takes longer to provide a
second datum after the first is provided.]
Vector machines such as Cray, NEC, and Fujitsu supercomputers typically
have many memory banks,
on the order of hundreds. Avoiding bank conflicts,
where consecutive memory references are not in different memory banks,
is crucial for performance on those machines. However (unfortunately
according to many application scientists) vector machines are not as readily
available in the U.S. as they once were. The wikipedia article on the
Cray X1
vector processor is good for more info about relatively modern Cray vector machines.
(A simpler system to consider when learning about vector processing is the
ancient Cray 1 system.)
Using memory banks is now well-embedded, and performance hits from bank
conflicts are relatively minimal on non-vector machines. Sidenote: all modern processors
are pipelined, for memory access, instruction execution, and operations.
"Vector machine" here refers to deeply pipelined machines, sometimes with
no cache.
Caches:
A more generally useful technique is to use a memory hierarchy,
which is implemented on every modern general purpose processor, including
some vector machines. The ideas are
A Simplified Workstation Memory Hierarchy
The picture on modern HPC (high-performance computing) machines is
much more complicated: as well as multiple levels of cache there
may be some main memory located further away than other parts of
main memory. However, the
ideas behind using a single cache are the same basic ones that underly
using a memory hierarchy in general: once you have gone to the expense
of moving a datum to a faster part of the hierarchy, do all you can with
it before it has to move back to a slower part. This is one form of
data re-use.
Cache analysis
Computers have an intrinsic clock that sends out signals to all components,
keeping them in synchronization. Each time interval is called a clock
cycle. This clock is not the same as the one you can invoke from a
C++ or Fortran code. Instead it is a heartbeat equal to the inverse of
the Ghertz rating of the CPU; e.g., 3.2 Ghertz corresponds to a clock
cycle time of 31.25 nanoseconds.
-
When a memory reference is to datum already in cache, the access is called
a cache hit.
-
When a memory reference is not in cache, the access is called a
cache miss.
-
Cache hits can supply datum in 1-2 clock cycles; cache misses typically
take 10-50 cycles.
-
To take advantage of any spatial data locality,
instead of bringing in single word bring in a line of words,
usually 4-32 words. Suppose the cache line size is 4 words; referencing any of memory addresses
0-3 will cause all of them to be brought into cache.
Slight digression:
memory addresses are indexed from 0. For mathematical entities, they usually
start from 1. The indexing scheme in a code depends on the programming language:
C/C++ or Java index from 0, Matlab starts with 1, while Fortran can index from
any integer, including negative integers. For mathematical operations and arrays,
typically 1-based indexing will be used, which matches most algorithm statements.
This rule holds except for computer hardware, and some signal processing
algorithms like FFTs (Fast Fourier Transforms).
-
Where does a line go in cache? Some hardware restricts locations in the
cache to which a line can be written. Also,
when does it get moved out? Since the cache is much smaller than main
memory, eventually it will get full and further memory accesses will
require moving a line out of the cache to make room for the new
line being referenced.
-
Various schemes exist for where in cache the line goes (direct mapped,
set associative, fully associative). This is not critically important in
scientific computing, generally because a poor policy simply makes the
cache appear smaller than it really is.
Cache misses can occur in three ways:
-
This is the first time the CPU asked for any data from its cache block.
These cache misses are called compulsory, cold-start, or
first
reference misses. They can occur also when your job gets swapped out;
then the data in the cache on your return typically belong to someone
else's job, and your program will take time to reload it via references
to data.
-
The item was in the cache earlier but later it was replaced by other data
because the cache has insufficient capacity. These cache misses are capacity
misses.
-
The item was in the cache earlier but later it was replaced by other data
mapped into the same block or set. These misses are called conflict
or collision misses.
An example of a cache hit:
Suppose that the first three hexadecimal digits of an address form the
tag for a cache line, and that the fourth and last hex digit specifies
where in the line that item is found. A reference to address 0132
above is a hit since its tag (013) appears. A reference to address
6317 in the above picture is a cache miss (tag not in cache). When
this happens,
-
The cache makes a call to the memory system to provide datum
-
While waiting for the data to arrive, the cache moves a line out to make room.
if the ousted line has been modified since it was loaded, it is written back
to memory. Otherwise, it is overwritten by the newly required line
Cache line replacement policies: Which line gets moved out when
a another one needs to move into the cache? Common policies are
-
Random:
just grab one and throw it back to memory
-
FIFO (First In First Out):
move out the one which was earliest into the cache.
-
LRU (Least Recently Used):
move out the one which has not been used in the longest time.
Which replacement policy is likely to be "best"? Since caches
form their own hierarchy with two or three levels of caches between the
processor and main memory, mixed strategies are common.
However, most machines tend to use LRU for their highest-level
(and largest) cache.
A model for memory access times
The effective access time teff to memory depends on the cache hit ratio
ch, which is the ratio of number of references to data in cache to
the total number of memory references. The basic model for effective access time is:
teff = ch * tcache + (1-ch) * tmemory
Typically, tmemory = 10*tcache. Even though this is a
linear model, notice that small changes
in ch can have a big effect on overall performance. When
going from a cache hit ratio 0.99 to 0.98 to 0.89:
teff = 0.99 tcache + 0.01 tmemory = 1.09 tcache
teff = 0.98 tcache + 0.02 tmemory = 1.18 tcache
teff = 0.89 tcache + 0.11 tmemory = 1.99 tcache
A drop of 1% in cache hits causes a 9% drop in the effective access time.
A drop of 10% leads to a doubling in the effective access time. This can
make a 3.2 Ghertz processor look like it is running at 1.6 Ghertz.
In reality, the ratio of tmemory to tcache is more likely to
be 50, not 10. In that case a cache hit ratio of 99% leads to a 50% drop in effective
access time. A cache hit ratio of 67% (about what a naive implementation of a
matrix-matrix multiply operation would get) leads to
teff = 17.2 tcache,
making that 3.2 Ghertz processor seem like one running at 186 Mhertz. A 200 Mhertz
processor is the fastest one available ... in 1995.
More about cache policies
Cache associativity: Where in the cache does a line go when it is moved to
the cache?
-
Direct mapped:
every line has only one slot in cache it can go to. In this
case, the replacement policy is irrelevant.
-
Fully associative:
every line can go into any slot.
-
Set associative:
"four-way", e.g., is divided into N sets, each with four
lines. A line can go into any of those four lines.
Update policy: When does a datum get move from cache to memory?
-
Write-back:
when its line is replaced
-
Write-through:
every time it is changed
Which of those two is likely to be faster? Why have both of them?
Other issues with caches:
-
Multilevel caches.
Several systems, e.g., Intel's Core i7, have
three or four levels of cache. Which one (primary, secondary, or tertiary)
determines performance?
That depends on their relative sizes and replacement policies. My claim is
that typically it is the first one that is outside the integrated chip
that the CPU is on - but I'm willing to be proven wrong. Proving me wrong
means via code performance tests, not vendor's documentation.
-
Separate caches for instructions and data.
Most systems separate
these two streams.
since the program instructions are read from memory
just like any other data, the same tricks can be applied there. Interestingly,
most scientific and engineering applications have little instruction
data traffic compared to the regular data traffic.
-
Interaction with the I/O system
- What if a virtual memory page is replaced in
memory, but it has an updated datum in the cache? Then the relevant values
(the ones on that page) in cache must
be "flushed", written back before the replacement occurs.
-
Virtual memory
works similarly to caches. However, ratio of access times
is ~500, not 10. A page fault is the equivalent of a cache
miss in the virtual memory system.
Effects of Cache: Performance of Three Kernel Operations
Cache effects can be seen by timing vector operations for
varying lengths of vectors. Because time will increase as the length
increases, plotting the rate (in Mflops/second) versus vector size
makes those effects more easily visible. The following plots are
for the operations
-
saxpy: y = y + alpha*x where x and y are vectors of length n and alpha
is a scalar.
-
dotpxy: alpha = xTy, where x and y are vectors of length n.
-
dotpxx: alpha = xTx, where x is a vector of length n.
The plot above allows extracting the effective cache size.
What is the cache size of the machine that generated the data?
Why do two of the
operations have high performance for longer vector lengths than
the other one?
With modern compilers
changing optimization settings can give vastly different results.
but the general rule still holds: minimze the number of memory references.
Second example of cache effects: matrix accesses for matrix-matrix
multiply. Operation is C = C + A*B, where A, B, and C are n x n matrices contained
in 2d arrays.
The standard way of writing the algorithm is
for i = 1:n
for j = 1:n
temp = C(i,j)
for k = 1:n
temp = temp + A(i,k)*B(k,j)
end
C(i,j) = temp
end
end
Trace through the memory access pattern in executing this algorithm, assuming
matrices are stored column-by-column (as they are in Matlab and Fortran).
C++ programmers: don't get cocky - you have the same number of cache misses
in C++, they just appear in a different order.
Getting good data locality is a matter of analyzing the loads and stores
needed by your program. That will further illustrated with some simple linear
algebra operations, but the principles are applicable to more than that.
Historically the relative performances of the vector operations has not changed.
Here's a graph of results from an SGI Power Challenge circa 1994:
At the time, the Power Challenge was considered a "supercomputer", and indeed
the performance rates are breath-taking: about 10% of what a workstation provided
in 2011, eleven computer generations later (assuming 18 months = one generation).
Next: pipelining
- Started: 17 Aug 2011
- Last Modified: Mon 04 Nov 2019, 06:55 AM