5. Hierarchical memory model, cache - part two

for lecturer: tutorial 5

Content of the class

Aim of today's class is to deepen knowledge about cache, to clarify and to show example of time and spatial locality principle. Try to optimize cache parameters.

Example A

Consider following example:

.globl start
.globl    array

.ent start

start:

la   s0, array     // load address of the first value in array to register s0
addi s1, zero, 3     // Number of cycle repetitions (number of iterations)
nop                 // Why nop here? Try to remove it. Will it have any influence on hit rate?

loop:
    beq s1,zero, end
    lw  s2, 0(s0)    // Read 0th value in array
    lw  s2, 4(s0)    // Read 1st value in array (4/4 = 1)

    lw  s2, 36(s0)   // Read 9th value (36/4=9)
    lw  s2, 40(s0)   // Read 10th value (40/4=10)

    addi s1, s1, -1
    j loop

end:
nop  

end_loop:           //Final infinite loop
    fence
    ebreak
    j end_loop

.data

.org 0x400

//.align    2

array:
.word    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12

Now consider instruction and data cache with size of 8 words (1 word = 4B) each. The cache is word-aligned – it means that address of the first byte in word is divisible by 4. Replacement policy is LRU (lease recently used). Fill-in following table for instruction and data cache. Study in details execution of the above program and study accesses into memory and cache.

Data cache:

	Directly mapped			N-way set associative	Fully asociative
Size/Block size/Blocks in sets:	8/1/1	8/2/1	8/8/1	8/2/2	8/4/2	8/2/4	8/1/8
Hit Count:
Miss Count:
Hit Rate:

Instruction cache:

	Directly mapped			N-way set associative	Fully asociative
Size/Block size/Blocks in sets:	8/1/1	8/2/1	8/8/1	8/2/2	8/4/2	8/2/4	8/1/8
Hit Count:
Miss Count:
Hit Rate:

Questions

What are your conclusions? Which cache has the best hit rate?
Is the fully associative cache always the best?
Does the hit rate depend on executed program or is it parameter of the cache?
Is it possible to have data in cache that will never be accessed? (Data we are not working with, instructions which are never executed,, etc…)
Is there optimal value for Block size parameter? Is there any rule for the value of Block size?
Find all information stored in the cache.
Find type of instruction and data cache use your computer? Find parameters - size, block size, blocks in set.

Directly mapped cache have one block in set. Fully associative cache has as many ways as it has blocks.

Example B

In the second exercise, you had been assigned the task to implement a program that computes the sum of two vectors each of four elements. Modify the program to work with vectors of 16 words each. Locate the three vectors exactly one after another. Set the cache parameters to 4 sets with four-word blocks and associativity 2 (32 = 4 x 4 x 2). Select LRU replacement policy and write-back. Set burst access time to 2. Observe the behavior if the target vector matches one of the input ones. Repeat when there are three independent vectors. Continue evaluation with 4 words inserted between the second and third vector. Change the cache parameters to (32 = 2 x 4 x 4).

Describe observed behavior and results.

Note. Because write-back policy is used, do not forget to include final instruction
fence
to flush cache to main memory and clean cache.

Example C

Take the program from the last class (selection sort) and cache with size of 8 words. Is there any replacement policy which you expect to perform better that others? Consider both data and instruction cache. Verify your conclusions with simulation.

Command line version of the simulator can be used to sped-up evaluation of different cache parameters.

qtrvsim_cli --i-cache lru,4,1,1 --d-cache lru,4,2,4,wb --burst-time 2 --dump-cache-stats --dump-range array,60,sorted-data.dat selection-sort

Parameters –i-cache and –d-cache specify configuration of cache memories the same way as for graphical version. –burst-time specifies sequential reads/writes access time 2 cycles, –dump-cache-stats command the emulator to dump statistic at program exit (when execution reaches break instruction). –dump-range allows to dump some memory range as word per line (address can be specified as label). Complete list of parameters is listed when –help parameter is specified.

Questions

Try the best parameters from Example A. Do these the parameters achieve the best hit ratio here as well?
Do you know another replacement policies different from those in MIPS Simulator? Try to find any on the internet or in the lectures.
Which write policy you expect to have the lowest impact on memory bus. (Transmits the lowest size of data).

Virtual memory D

Additional questions:

How does virtual address translation work?
What is the page directory base register?
What information is stored in the page table entry?
If a 32-bit processor with 32-bit virtual and physical address memory management uses a paging system to map physical memory to virtual address spaces, and one page size is 4 KB, how many table levels will be used if it is required that each page table directory or leaf table fit into one page of memory for each level and how will the virtual address be split? The size of one page table entry is 4B.

Task E

Implement C/C + + program, which prints parameters of cache memory of your processor (processor on which is program executed) and size of virtual memory page. Use library function sysconf, and parameters _SC_PAGESIZE, _SC_LEVEL1_DCACHE_LINESIZE etc. COmplete list of parameters supported by GNU C library (GLIBC) can be obtained from library header file bits/confname.h. Parameters are documented in the section Constants for Sysconf of library manual.

Compare results which are obtained by by read of pseudo-filesystems exported by operating system kernel. Use commands

cat /proc/cpuinfo
cat /sys/devices/system/cpu/cpu0/cache/index0/size
find /sys/devices/system/cpu/cpu0/cache/ -type f -maxdepth 2 -print -exec cat '{}' ';'

Other options are commands lscpu, sudo dmidecode if you have administrator rights.

Task F

Re-implement program selection from previous seminaries to C/C++ language. Determine hit rate in the first level instruction and data cache. Continue with hit rate of the last level cache – when you run the program on the computer in KN:2 laboratory. How many accesses to data cache is used? How many instructions have been executed? Try to change optimization level specified in compilation command.

Advices:The cachegrind tool allows to evaluate how program access memory and caches.

 
valgrind --tool=cachegrind ./a.out

where a.out is compiled program executable. Program cg_annotate can be used for detailed analysis what happens during program execution. Example:

cg_annotate filename ~/main.cpp

where filename is the name of the file generated by cachegrind tool. Specify the path to your source file (main.cpp) as absolute. Attention, program has to be compiled with g (debug info) switch on.

Remark: If not enough time is left, you can use bubble sort algorithm from the third seminar.

Example of the program output follows:

Table of Contents

5. Hierarchical memory model, cache - part two

Content of the class

Example A

Questions

Example B

Example C

Questions

Virtual memory D

Task E

Task F