## B4M36ESW: Efficient software Lecture 7: Memory, caches, algorithms

#### Michal Sojka michal.sojka@cvut.cz



April 23, 2018

#### 1 Why is DRAM slow?

#### 2 Caches

- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- 3 Caches & memory in multi-processor systems
  - True and false sharing
  - NUMA

## 1 Why is DRAM slow?

#### 2 Caches

- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- 3 Caches & memory in multi-processor systems
   True and false sharing
   NUMA

## Types or RAM

- Static RAM (SRAM)
  - Fast but expensive
  - 6 transistors per bit

## Dynamic RAM (DRAM)

- Capacitor (Dis)Charging is not instantaneous
- Reading discharges capacitor (write after read)
- Compromise: capacity/size/power consumption







## DRAM in the computer





 CPU contains a memory controller (MC)

- MC talks to DRAM chips via "Memory Bus", using a protocol
- Details on next slides

## How DRAM chips work?

- Addressing individual cells is impractical (many wires)
- Chip is organized in rows and columns (and banks), address is multiplexed
- In the chip, row and column multiplexers (green and pink rectangles) select the lines according to address bits
- R/W operations happen in many chips in parallel to work with the whole data word (64 bits)
- Writing: New value is put on Data signal after row and column address were selected (see next slide)
  - It takes some time to charge the capacitors



#### Why is DRAM slow?

## SDRAM communication protocol

- Access protocol is synchronous
   there is a clock signal
- SDRAM (Synchronous DRAM)
- CLK provided by memory controller (FSB frequency – typ. 800–1600 MHz)
  - Double/Quad-pumped
- Max. speed: 64 bit × 8 × 200 MHz = 12.8 GB/s
  - Not reachable in reality
- DRAM technology requires t<sub>RCD</sub> and CL delays (they cannot be shortened)
- Data sent in bursts
  - Size of the burst corresponds to cache-line size
  - Sending just one word would be very inefficient due to t<sub>RCD</sub> and CL delays



## Timing parameters of standard DDR4 modules

| Standard<br>name                        | Memory<br>clock<br>(MHz) | I/O bus<br>clock<br>(MHz) | Data<br>rate<br>(MT/s) | Module<br>name | Peak<br>transfer<br>rate<br>(MB/s) | Timings,<br>CL-tRCD-<br>tRP      | CAS<br>latency<br>(ns) |
|-----------------------------------------|--------------------------|---------------------------|------------------------|----------------|------------------------------------|----------------------------------|------------------------|
| DDR4-1600J*<br>DDR4-1600K<br>DDR4-1600L | 200                      | 800                       | 1600                   | PC4-<br>12800  | 12800                              | 10-10-10<br>11-11-11<br>12-12-12 | 12.5<br>13.75<br>15    |
| DDR4-1866L*<br>DDR4-1866M<br>DDR4-1866N | 233.33                   | 933.33                    | 1866.67                | PC4-<br>14900  | 14933.33                           | 12-12-12<br>13-13-13<br>14-14-14 | 12.857<br>13.929<br>15 |
| DDR4-2133N*<br>DDR4-2133P<br>DDR4-2133R | 266.67                   | 1066.67                   | 2133.33                | PC4-<br>17000  | 17066.67                           | 14-14-14<br>15-15-15<br>16-16-16 | 13.125<br>14.063<br>15 |
| DDR4-2400P*<br>DDR4-2400R<br>DDR4-2400U | 300                      | 1200                      | 2400                   | PC4-<br>19200  | 19200                              | 15-15-15<br>16-16-16<br>18-18-18 | 12.5<br>13.33<br>15    |

Source: Wikipedia

#### 1 Why is DRAM slow?

#### 2 Caches

- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- 3 Caches & memory in multi-processor systems
   True and false sharing
   NUMA

## Cache terminology

Spatial locality: accessed memory objects are close to each other

- Code: inner loops
- Data: structures (reading of one field is often followed by reads of other fields)
- Temporal locality: The same data will be used multiple times in a short period of time
  - Code: loops
  - Data: e.g. digital filter coefficients are accessed every sampling period
- Cache hit: memory request is serviced from the cache, without going to higher level memory
- Cache miss: opposite of cache hit
  - cold miss, capacity miss, conflict miss
  - true sharing miss, false sharing miss
- Cache line eviction: cache line is removed from the cache to make space for new data
- Cache replacement policy: LRU, pseudo LRU, random

#### 1 Why is DRAM slow?

#### 2 Caches

#### Architecture

- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- 3 Caches & memory in multi-processor systems
   True and false sharing
   NUMA

## CPU caches - big picture

- All loads/stores go through cache
- CPU ↔ Cache: fast connection
- **Cache**  $\leftrightarrow$  Main memory: FSB bus
- It is advantage to have separate caches for instructions and data





## Cache associativity

#### Direct-mapped cache

- simple
- Fully associative cache
  - ideal
- Set associative cache
  - compromise

## **Direct-mapped cache**



- Each memory location has just one cache line associated with it.
- Memory locations at multiples of cache size always collide!
- Besides data, cache stores the tag

## Self-evicting of code



```
__attribute__((hot)) void outer_func();
__attribute__((hot)) void inner_func();
```

```
void outer_func() {
  for (int i = 0; i < 1000; i++)
     inner_func();
}
void inner_func() {
    // do something
}</pre>
```

- Two cache misses every iteration (instruction fetches)!
- Solution: Improve code layout by putting related (and hot) functions together.

## Cache write policies

Write-back "Common" case. Written data is cached for later reuse.

Write-through Written data bypass the cache and therefore never evicts other data from the cache. Useful when we know the data will not be needed soon.

```
#include <emmintrin.h>
void _mm_stream_si32(int *p, int a);
void _mm_stream_si128(int *p, __m128i a);
void _nm_stream_pd(double *p, __m128d a);
#include <commintrin.h>
void _nm_stream_pi(__m64 *p, __m128 a);
void _nm_stream_sc(float *p, __m128 a);
void _nm_stream_sd(double *p, __m128 a);
void _nm_stream_sd(float *p, __m128 a);
```

Write-Combining All writes to the cache line are combined together and written at once. This avoids one memory read, because when the cache line is fully overwritten, there is no point in reading the old value. Write combining is often used for frame buffer memory (e.g. filling screen with a color).

## Set associative caches



Examples: LRU, random, ...

## 1 Why is DRAM slow?



Architecture

#### Memory performance characteristics

- Data structures and dynamic memory allocations
- Matrix multiplications
- 3 Caches & memory in multi-processor systems
   True and false sharing
   NUMA

## Sequential access



## Random access



Intel Core i7-2600, (perf counters not in scale)

## Translation Lookaside Buffer (TLB)

- Caches translation of virtual to physical address
- On TLB miss, page walk has to be performed (2 to 5 levels)
- Intel i7-2600 has 512 L2 TLBs  $\Rightarrow$  512×4 kB = 2 MB
- Improvement: use so called huge pages (1 page = 2 MB, PS=1)
  - Linux: in some cases automatically or explicitly via hugetlbfs



Figure 4-8. Linear-Address Translation to a 4-KByte Page using IA-32e Paging

## Cache-related preemption delay

- When a thread is preempted by another thread, the preempting thread likely evicts some data from the cache.
- After preemption ends, the preempted thread continues executing and experiences a lot of cache misses!
- Certain (older) architectures has to flush TLBs when switching address spaces (processes).
  - Modern architectures allow tagging TLBs with address space identifier (ASID, PCID, ...)
- High-performance software tries to limit preemptions.
  - Beware limiting preemption increases response time!

#### 1 Why is DRAM slow?

#### 2 Caches

- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- 3 Caches & memory in multi-processor systems
   True and false sharing
   NUMA

## Data structures and cache friendliness

- Arrays + sequential access nice
- Dynamically allocated linked lists depends on memory allocator (probably like random access – bad)
- Search trees random access
- For most data structures/algorithms, there exist cache-optimized variants.
- These are more tricky than textbook examples.

## Dynamic memory allocator (malloc(), new)

Memory allocators try to maintain spacial and temporal locality
 Spatial locality is hard to achieve when heap is fragmented
 after many new/delete operations
 Temporal locality – when memory is freed/deleted, subsequent allocation tries to use that memory because it is cache-hot.

#### 1 Why is DRAM slow?



- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- 3 Caches & memory in multi-processor systems
   True and false sharing
   NUMA

## Matrix multiplication

Naive implementation

#### Matrix multiplication: Naive



Totals: mem:24 cache hits:11 ≅45%

- One matrix element: double (8 B)
- Cache line size: 16 B
- Fully associative caches
- L2 cache: 128 B, L1 cache: 32 B

## Implementation with transposition

#### Matrix multiplication: B transposed



Totals: mem:24 cache hits:15 ≅62%



Performance (execution time): naive: 100%, transposed: 23,4%

## **Tiled implementation**

#### Matrix multiplication: Tiled, B transposed



```
for (k1 = 0; k1 < N; k += tile)
for (j1 = 0; j1 < N; j += tile)
for (i1 = 0; i1 < N; i += tile)
for (i = i1; i < i1 + tile; ++i)
for (j = j1; j < j1 + tile; ++j)
for (k = k1; k < k1 + tile; ++k)
C[i][j] += A[i][k] * B[k][j];</pre>
```

- Each "tile" fits into the cache
- Performance: 17.3% of naive implementation (9.5% with vectorized (SIMD) operations)

## Tiled implementation and L1 cache

#### Matrix multiplication: Tiled, B transposed



No temporal L1 cache hit in B
75% L1 hits (in total)

## Two-level tiled implementation

#### Matrix multiplication: 2-level tiled, B transposed



```
 \begin{array}{l} \text{for } (k2 = 0; \, k2 < N; \, k2 + \text{tile2}) \\ \text{for } (j2 = 0; \, j2 < N; \, j2 + \text{tile2}) \\ \text{for } (i2 = 0; \, i2 < N; \, i2 + \text{tile2}) \\ \text{for } (k1 = k2; \, k1 < k2 + \text{tile2}; \, k + \text{tile1}) \\ \text{for } (j1 = j2; \, j1 < j2 + \text{tile2}; \, j + \text{tile1}) \\ \text{for } (i1 = i2; \, i1 < i2 + \text{tile2}; \, i + \text{tile1}) \\ \text{for } (i = i1; \, i < i1 + \text{tile1}; + +i) \\ \text{for } (j = j1; \, j < j1 + \text{tile1}; + +j) \\ \text{for } (k = k1; \, k < k1 + \text{tile1}; + +k) \\ \text{C[i][j] + = A[i][k] * B[k][j]; } \end{array}
```

79% L2 hits

## **Recursive matrix multiplication**

- Generalization to arbitrary number of cache levels
- $N \times N$  multiplication = 8 multiply-add of  $(N/2) \times (N/2)$  multiplications

$$\begin{bmatrix} C_{11} & C_{12} \\ \hline C_{21} & C_{22} \end{bmatrix} = \begin{bmatrix} A_{11} & A_{12} \\ \hline A_{21} & A_{22} \end{bmatrix} \times \begin{bmatrix} B_{11} & B_{12} \\ \hline B_{21} & B_{22} \end{bmatrix} = \begin{bmatrix} A_{11}B_{12} & A_{22} \\ \hline A_{21}B_{11} & A_{21}B_{12} \end{bmatrix} \times \begin{bmatrix} A_{12}B_{21} & A_{12}B_{22} \\ \hline A_{22}B_{21} & A_{22}B_{22} \end{bmatrix}$$

## 1 Why is DRAM slow?

#### 2 Caches

- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- Caches & memory in multi-processor systems
   True and false sharing
   NUMA
  - NUMA

## Cache coherency

In symmetric multi-processor (SMP) systems, caches of the CPUs cannot work independently from each other.

- Maintaining of uniform view of memory for all processor is called "cache coherency"
- If some processor writes to a cache line, other processors have to clean the corresponding cache line from their caches.
  - Remember: inter-core (inter-socket) communication is "slow"
- Cache synchronization protocol: MESI(F)
  - A dirty cache line is not present in any other processor's cache.
  - Clean copies of the same cache line can reside in arbitrarily many caches.

#### 1 Why is DRAM slow?

#### 2 Caches

- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- Caches & memory in multi-processor systems
   True and false sharing
   NUMA

## True sharing

- Program is slow because cache line with shared data travel from one core to another.
- Typically example of true sharing: each mutex is shared between CPUs.
- When that is a problem (too much contention):
  - make locking more fine-grained,
  - or change your data structure (e.g. per-CPU data),
  - and/or algorithms to be more cache friendly.

```
std::atomic_int32_t counter;
```

```
void thread_cpu0() {
  while (true)
     counter++;
```

}

```
void thread_cpu1() {
  while (true)
     counter++;
}
```

#### All CPUs executing atomic increment of global variable



## False sharing

Data accessed from different CPUs is not shared but happen to be stored in a single cache line.

// Per-CPU counters (FIXME: Do not hardcode cache line size)
std::atomic\_int32\_t counter\_cpu0 \_\_attribute\_\_((align(64)));
std::atomic\_int32\_t counter\_cpu1 \_\_attribute\_\_((align(64)));

```
void thread_cpu0() {
    while (true)
        counter_cpu0++;
    }

void thread_cpu1() {
    while (true)
    counter_cpu1++;
}
```

Linux: How to determine cache size at run time: sysconf(\_SC\_LEVEL1\_DCACHE\_LINESIZE);

## 1 Why is DRAM slow?

#### 2 Caches

- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications

# Caches & memory in multi-processor systems True and false sharing NUMA

## Non-Uniform Memory Access (NUMA)



## Thread migrations between cores

#### OSes tend to do load balancing

- By default threads are automatically migrated from overloaded to underloaded cores
- Migrated threads loose their cache footprint (cache-related migration delay)
- Migrated threads loose their NUMA locality
- If you do your own load balancing in the application, pin the threads to CPUs (set their CPU affinity):

```
cpu_set_t cpuset;
pthread_t thread;
thread = pthread_self();
/* Set affinity mask to include only CPU 2 */
CPU_ZERO(&cpuset);
CPU_SET(2, &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
```

## libnuma (Linux)

#include <numa.h>

```
int numa available(void);
int numa_max_possible_node(void);
int numa_num_possible_nodes();
int numa max node(void);
//...
int numa preferred(void);
void numa_set_preferred(int node);
void numa set interleave mask(struct bitmask *nodemask);
//...
void numa bind(struct bitmask *nodemask);
void numa set localalloc(void);
void numa set membind(struct bitmask *nodemask);
```

#### 1 Why is DRAM slow?

#### 2 Caches

- Architecture
- Memory performance characteristics
- Data structures and dynamic memory allocations
- Matrix multiplications
- 3 Caches & memory in multi-processor systems
   True and false sharing
   NUMA

## Size matters

- Even though we have terabytes of memory, size and layout of the data structures still matters.
- Only few kilobytes of memory is fast, the rest is slow!



 Ulrich Drepper, "What Every Programmer Should Know About Memory", 2007/11 [online], http://people.redhat.com/drepper/cpumemory.pdf