Warning

4. Hierarchical memory model, cache

Exercise outline

1. Introductory example
2. Basic cache design and principle

What should I repeat before the tutorial

1. Repeat how the cache works
2. Repeat meaning of:
1. fully associative cache
2. N-way set associative cache
3. direct-mapped cache
3. Repeat what write-through and write-back means.
4. Finish the bubble sort from the last exercise.

Content of the class

Introductory example

A. How long (in ms) it will take to execute a program with 1000 instructions on a 1GHz processor? Assume that one instruction lasts one clock.

B. How long it will take to execute the program on the same processor when 20% of instructions are memory accesses. For simplicity assume that one memory access lasts 70ns.

C. What is the speedup of the program (with 1000 instructions) when we replace the previous processor with a 2GHz version. The memory access time remains the same.

D. How long it will take to execute the program with old 1GHz processor and we will add a cache. This cache has 80% hit rate and search for value in cache take 5ns.

Cache

In the following part we will work with 12 bit address, data are 8bits wide.

Directly mapped cache

1. Plot directly mapped cache with 4 records.
2. Explain how to find correct record. How many bits do we need for tag and for record address.
Example

Run the QtMips simulator and select simple processor without cache in the basic setup (No pipline no cache). We will use variant which is compatible with real MIPS CPU architecture, that is with enabled delay-slot execution. Delay-slot fill by suitable instruction or NOP is left to the assembler (we do not use directive set .noreorder om the source).

Next step is to compile following program (it is available in /opt/apo/selection-sort-en directory). The program implements simple sorting algorithm (Selection-Sort) run on 15 integer numbers. Load program into QtMips simulator and open windows with memory access statistics (Window → program cache and Window → Data cache) and take a note about number of memory reads and writes as well as number of additional cycles required to access memory.

As a next step, configure simulator to use four words of directly mapped data and program memory.

The cache size 4 words = number of sets 4 x number of words in block 1 x degree of associativity 1

Measure results for as many combinations of cache parameters as possible while maintaining cache size 4 words (product of number of sets, block size and degree of associativity). Observe changes of the cache content and answer following questions:

1. Find Hit and Miss count.
2. How many memory accesses the program executes?
3. Check how these figures changes when cache size is increased to 8 and 16 words.

// Simple sorting algorithm - selection sort

// Select the CPU core configuration with delay-slot

// Directives to make interresting windows visible
#pragma qtmips show registers
#pragma qtmips show memory

.set noreorder
.set noat

.globl    array

.text
.globl _start
.ent _start

_start:

la   $a0, array addi$s0, $0, 0 //Minimum value from the rest of the array will be placed here. (Offset in the array, increasing by 4 bytes). addi$s1, $0, 60 // Maximal index/offset value. Used for cycle termination = number of values in array * 4. add$s2, $0,$s0 //Working position (offset)
// $s3 - offset of the smallest value found so far in given run //$s4 - value of the smallest value found so far in  given run
// $s5 - temporary main_cycle: beq$s0, $s1, main_cycle_end nop add$at, $a0,$s0
lw   $s4, 0($at)   // lw  $s4, array($s0)
add  $s3,$s0, $0 add$s2, $s0,$0

inner_cycle:
beq  $s2,$s1, inner_cycle_end
nop
add  $at,$a0, $s2 lw$s5, 0($at) // lw$s5, array($s2) // expand bgt$s5, $s4, not_minimum slt$at, $s4,$s5
bne  $at,$zero, not_minimum
nop

addi $s3,$s2, 0
addi $s4,$s5, 0
not_minimum:
addi $s2,$s2, 4
j inner_cycle
nop
inner_cycle_end:
add  $at,$a0, $s0 lw$s5, 0($at) // lw$s5, array($s0) sw$s4, 0($at) // sw$s4, array($s0) add$at, $a0,$s3
sw   $s5, 0($at)  // sw $s5, array($s3)

addi $s0,$s0, 4
j main_cycle
nop
main_cycle_end:

//Final infinite loop
end_loop:
cache 9, 0(\$0)  // flush cache memory
break           // stop the simulator
j end_loop
nop

.end _start

.data
// .align    2 // not supported by QtMips

array:
.word    5, 3, 4, 1, 15, 8, 9, 2, 10, 6, 11, 1, 6, 9, 12

// Specify location to show in memory window
#pragma qtmips focus memory array

Hint: If you need to execute long program and you are interested only in the final results, you can use Run button and select Max speed to execute instructions automatically. Program is finalized by instructions to flush the cache and stop execution (break) on which continuous execution stops.

Fully associative cahce

1. Explain differences between fully associative cache and directly mapped cache.
2. Draw a fully associative cache for 4 records (12 bit address and 8 bit data).
3. Explain how to find a place for the given address.
Example

In QtMips set parameters of the cache as shown on the figure bellow. Size 4 word = number of sets 1 x block size 1 and degree of associativity 4.

Run the program again with new cache parameters and check cache performance.

1. Find Hit and Miss count.
2. How many memory accesses the program executes?
3. Change the cache to have 8 records. Check how this change affects the hit/miss count.

N-way set associative cache

1. Explain differences between this and previous caches. Are there any advantages over above versions of cache?
2. Draw a 2-way set associative cache. There are 4 records in cache (12 bit address and 8 bit data).
3. Explain how to find a place for the given address.
Example

In Mips set parameters of the cache as shown on the figure bellow. Size 4 words = number of sets 2 x block size 1 x degree of associativity 2.

Run the program again with new cache parameters and check cache performance.

1. Find Hit and Miss count.
2. How many memory accesses the program executes?
3. Change the cache to have 8 records. Check how this change affects the hit/miss count.