Intro

RAM data burst (prefetch):
burst_length
SDRAM – 1
DDR – 2
DDR2 – 4
DDR3 – 8

since data_burst = burst_length*bus_width, and typical bus_width is 64bit (word in case od SDRAM):
SDRAM – 8B
DDR – 16B
DDR2 – 32B
DDR3 – 64B
DDR4 – 64B (due to typical cache line size. 128B might waste time and not be so effective)

If cache line is 64 bytes long, cpu will need to read SDRAM 8 times, 8B each.  OR use RAM data burst to load whole cache line in single memory access.

Burst ordering

-sequential
-interleaved (przeplatane)
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/nanya-nt5ds-datasheet.pdf#page=13

Replacement policies

https://en.wikipedia.org/wiki/Cache_replacement_policies

Structure

Digital logic abstract

From programmers point of view, knowing exact digital schematic is not needed (essentially these are just registers, muxes and few comparators), so I will use abstract structures to explain implementation of data cache.

Cache line

Cache line is memory cell – it has data space of defined size, and address that points to it.
Typical single modern cache line consist of 64 bytes, but for ease of explaining lets assume we have 4 byte cache. It can be viewed like this:

Of course, byte order depend on HW architecture.

Since this structure is supposed to mirror real memory (ie. flash) that is using 32bit addressing (0x0000_0000 – 0xFFFF_FFFF), this 32 bits need to point to correct cache line/byte. So we need to not only to find address of whole cache line, but also know which byte is needed in case of byte access. To address 4 elements, we need 2 bits, because 2^2 = 4. These bits are commonly called cache INDEX, while the remaining bits of address are called cache TAG:

Example: so for cpu with 64 byte cache lines, INDEX would be 6 bits long (2^6 = 64) (address[5:0]) and TAG 32 – 6 = 26 (address[31:6]).

This structure is cache address and it maps directly to memory address.

Also each cache line consist of status bits, which depends on architecture. For example ARM typically use two bits:

valid set to 0 on reset, otherwise always valid – simply says that cache lines has never been used yet
dirty set when corresponding data in memory is different than in cache

these 2 bits, would extend the view of single cache line to:

This explain how to interpret single cache line. But single cache is of not much use (unless it is instruction cache) and these are grouped together in many different ways.

Fully associative cache

In previous chapter I used 4 bytes cache line as example. If such cache would be used with 32bit flash memory 4 bytes aligned memory, 1 cache line would mirror 1 flash cell. Such construct is also called Fully Associative. Since reading is boring, let’s move to examples.

Example 1: Read memory with cache

My test bench will be imaginary CPU:
– Fully associative 16 byte cache, 4 cache lines – 4 bytes of data each
– 4 x 32bit registers, 8 bit program counter (PC)
– 32 byte flash piece with 32bit addressing and 32 bit access
– code & ram is separated
4 instructions:
– LDR dst, src ; load data register
– STR dst, src; store data register
– ADD dst, src1, src2; dst = src1 + src2
– B #dst ; branch

Check this beauty below:

 

Lets assume memory access timings in cycles:
– single instruction 1 cycle (use of immediate  value)
– register 1 cycle
– cache 8 cycles
– flash/memory 64 cycles

Without cache, it is easy to calculate total cycles per iteration:
LDR (memory: 64) + LDR (memory: 64) + ADD(registers: 1) + STR(memory: 64) + B (immediate: 1)
194 cycles for each iteration.

Now let’s see what happens with cache available – connect the clock, power up the target and let’s roll:

Step 1: first loop

Action Cost
PC point to 0x00 – LDR. LDR try to access memory at address 0x1234_0004, but before accessing flash, cache is being searched. Since the target was just boot up, cache is empty and nothing was found, so memory access occurs wasting 64 cycles. But when flash read occurs, empty cache line is loaded, mirroring 0x1234_0004 current value. R0 is loaded with new data.
Cache line valid bit is set to 1.
64 cycles

Step 2: first loop

Action Cost
PC increments and now point to code address 0x01 – LDR. LDR ask for 4 bytes at address 0x1234_0008. Again CPU search cache before reading memory, but finds nothing – resulting in accessing flash directly :(. Data at address 0x1234_0008 is loaded to empty cache line and R1. 64 cycles

Step 3: first loop

Action Cost
PC increments and now point to code address 0x02. R0 and R1 are added and result is stored in R2. 1cycle

Step 4: first loop

Action Cost
PC increments and now point to code address 0x03. STR instruction write value in R2 to address 0x1234_008. What interesting happens, is that now, that value in flash is different than in cache mirror, CPU set Dirty bit to 1 – meaning that value in cache is no longer valid mirror! 64 cycles

Step 5: first loop

Action Cost
PC increments and now point to code address 0x04. B instruction changes PC back to 0, starting the loop again, but since its not reset, cache & memory stay intact. 1cycle

Step 6: second loop

As it can be seen above, first loop took exactly the same amount of time like no cache version! But let’s see second loop.

Action Cost
PC again point to 0x00 – LDR. LDR try to access memory at address 0x1234_0004, but before accessing flash, cache is being searched. This time, cache line with matching address is found! Since Dirty bit is not set, it means that data inside is the same like in flash, so this data is loaded from cache to r1 with 8 cycles (cache) instead of 64 (memory). When CPU find cache line with valid data, this event is called Cache Hit. 8 cycles

Step 7: second loop

Action Cost
PC point to 0x01. LDR try to access memory at address 0x1234_0008, but before accessing flash, cache is being searched. This time, cache line with matching address is found! But, Dirty bit is set, meaning data in cache line is no longer valid! So CPU need to issue memory access (64 cycles) to read current value from memory and at the same time, update cache line & set Dirty bit to 0. 64 cycles

Step 8,9,10: second loop

All other steps follow steps 3,4,5 in number of cycles used. As it visible, when there is no cache at all it took 194 cycles every single iteration. With this simple cache 1 memory access time was reduced from 64 to 8, resulting in 128 cycles in (0 + n ; n > 0) iteration. Executing this loop 1000 times will save 56000 cycles and this is simplest of cache!

Let’s assume the clock would run at 1kHz, so 1 cycle would take 1 ms, so 56000 cycles  would be 56 seconds faster than no cache version.
Although in modern computing 1 cycle can take way less than nano second, but in contrast data operations are getting bigger and bigger and memory can be the real choke point.

Direct-mapped cache

No lets modify test bench from example 1: now flash memory allow for burst access of size 4 bytes and length 2. So with single read/write operation two 8 bytes flash cells can be accessed!

The number od data held in cache line can be extended to hold whole 8 bytes flash burst:

But the interesting part is LDR and STR still can read only 32 bits at a time. How to address bytes 4, 5, 6, 7 then?
Translation from flash address to cache address will now change:

Since there are 8 bytes to be addressed, 3 bits are needed as INDEX. Lets say there is

2-way associative

Real life use

intel, arm, ppc

Real life examples

Pre-fetch

mats:
https://lwn.net/Articles/252125/