Experimentally Determining the Size of a Cache Line


As with the overall cachesize, some simple experiment can determine its line size. Choose a value n large enough that an array x of length n does not fit into cache, e.g. one of length n = 109 suffices until caches reach 8 Gbytes in size. Then compute the dotproduct of two vectors of that length and get the Gflop/sec computational rate. Do this for strides (AKA stepsizes) of s = 1, 2, 4, 8, .... [Keep in mind when computing the gflops/sec, the number of flops is 2*(n/s), not 2*n.]

Assuming that the cacheline size is some power of two, when the stride is increased from 1 to 2 only half of the vector data being brought into the cache gets used. A cache miss triggers a full line of contiguous data to be loaded into the cache, and a stride of 2 means that only every other one is used.

Each time the stride is doubled, the amount of data that is brought into the cache that actually gets used in the computation is halved; the rest are brought in because the memory system always brings in a full line. For some stride s, however, doubling the stride to 2*s won't cause a performance drop. That occurs when the stride s brings in a line of data of length s, but then only uses one word per line. Doubling to 2*s also brings in a line of s words, but again only one word gets used. The next line of s words is not referenced at all, so it does not add to the memory traffic and the performance no longer drops.

So, e.g., if the line size is 8 words, accessing x(33) with a stride s = 8 will bring in x(33:40) but only use the first element x(33). When the stride is doubled to s = 16, referencing x(33) brings in the same line x(33:40). However, the next referent is x(49) which is again a cache miss and brings in x(49:56). A stride of s = 16 skips over and does not access the line of data in x(41:48), but a stride of s = 8 does. However both strides are incurring a cache miss on every add/multiply, and it is irrelevant (mostly) where the line is in memory; the cache hit ratio is 1/s = 1/8 no matter how large s grows.

In class, an example was show for a cacheline size of 4 words. With s = 1, the cache hit ratio was 75%; with s = 2 it was 50%, and for s = 4, 8, 16 the ratio remains 25%. So as the stride s increases, performance will cease dropping as soon as s = 2*cacheline size.

Nominally this effect could be determined using just one large vector length n. But to make sure it is not a fluke, time things for several vector lengths. The number of vector lengths to use need not be as large as it takes for determining cachesize, where delineating the breakpoint in performance was important.

Following are some results running on an Intel Quad Core Q9550. As with cachesize experiments, four compiler optimizations are used. Computational rates are in Mflop/sec. The optimizations change the absolute performance, but the approximate halving of performance as stride increases holds for all of them ... except -O0.


Compiler Optimization: -O3 on an Intel Quad Core Q9550

[cacheline-longly3.png]

The performance is roughly halved going from s = 1 → 2 → 4 → 8, but doubling the stride again to s = 16 does not significantly decrease the speed. On this machine the cacheline is probably 8 doubles = 64 bytes wide. Now see what happens for the same computations, but with decreasing levels of compiler optimization:


Compiler Optimization: -O2 on an Intel Quad Core Q9550

[cacheline-longly2.png]


Compiler Optimization: -O1 on an Intel Quad Core Q9550

[cacheline-longly1.png]


Compiler Optimization: -O0 on an Intel Quad Core Q9550

[cacheline-longly0.png]

So why doesn't the performance drop with increasing stride hold for the -O0 (zero optimizations) case? Compare the absolute rates across the different compiler optimizations. The -O3 computation is 4-5 times faster than the -O0 one. Using no optimizations at all make the computation so slow that the cache effectiveness is minimized.


So Who Cares?

Knowing the cacheline size provides some information to populate the cache model
teff = ch * tcache + (1-ch) * tmemory
which has three quantities: ch, tcache, and tmemory. With those the effective memory access time teff, and in turn the execution time, can be predicted for a code.

Knowing the cacheline size is not as critical as knowing the cache size itself and most importantly how to use that knowledge in practice. However, strided vector accesses are common in scientific computing, and not just for accessing 2D arrays by rows and columns. The mathematical formulation of the fast Fourier transform (FFT) doubles the stride through arrays on each step, just as was done above. More generally, hierarchical methods have a log-tree structure that also increases stride from one level to the next.

Multiple length strides can be reduced to just two, by copying the widely separated data contiquously into a temporary array and then writing it back out when finished computing with it. The experiments above give a good indication for what strides this should be done, and what performance benefit/penalty can be expected. Which in turn means whether or not it's worth the coding time and effort.


Back to the memory modeling.