# **Effective Software**

Lecture 5: Data races, synchronization, atomic operations, non-blocking algorithms

David Šišlák <u>david.sislak@fel.cvut.cz</u>

# **Data Races – Multi-threaded Environments**

```
public int A = 0;
public int B = 0;
public int C = 0;
public int D = 0;
```

|   | Thread 1                                                                    | Thread 2                                                                    |  |
|---|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------|--|
| V | <pre>public void method1() {     int r2 = A;     B = 1;     D = r2; }</pre> | <pre>public void method2() {     int r1 = B;     A = 2;     C = r1; }</pre> |  |

» what can be the results for C and D?

## **Data Races – Multi-threaded Environments**

```
public int A = 0;
public int B = 0;
public int C = 0;
public int D = 0;
```

|        | Thread 1                                                                    | Thread 2                                                                    |  |
|--------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------|--|
| ↓<br>↓ | <pre>public void method1() {     int r2 = A;     B = 1;     D = r2; }</pre> | <pre>public void method2() {     int r1 = B;     A = 2;     C = r1; }</pre> |  |

- » what can be the results for C and D?
  - C=0, D=0
  - C=1, D=0
  - C=0, D=2
  - anything else?

# Data Races – Disassembled Method and Assembly Code

| instructions reordered in C2 c     | ompiler:     |      | 4B / 8B – Klass ref. |
|------------------------------------|--------------|------|----------------------|
|                                    | IS. recuri   |      | 8B - mark word       |
|                                    | 15. noturn   | ,    |                      |
|                                    | 12: putfield | #5 / | // Field D:I         |
|                                    | 11: iload_1  |      |                      |
|                                    | 10: aload_0  |      |                      |
|                                    | 7: putfield  | #3 / | // Field B:I         |
| }                                  | 6: iconst_1  |      |                      |
| D = r2;                            | 5: aload_0   |      |                      |
| $\mathbf{B}=1;$                    | 4: istore_1  |      |                      |
| <b>int</b> r2 = <b>A</b> ;         | 1: getfield  | #2 / | // Field A:I         |
| <pre>public void method1() {</pre> | 0: aload_0   |      |                      |
|                                    |              |      |                      |

#### **RSI** is this

\$0x1,0x10(%rsi)

;\*putfield B

; - datarace.DataRace::method1@7 (line 11)

... object data

0x0000000106399253: mov 0xc(%rsi),%r11d 0x0000000106399257: mov %r11d,0x18(%rsi) ;\*putfield D ; - datarace.DataRace::method1@12 (line 12)

- » the same reordering happens in method2 resulting into fourth output
  - C=1, D=2

0x00000010639924c: movl

### **Data Races – CPU Execution Pipelining**

» simplified non-parallel instruction pipelining in each core



# Data Races – CPU Memory Model

» CPU vs. core vs. thread

|      |              |      |      |      |      |      |      |      |      |                             | L1 Data Cache |              |                      |           |           |              |              |  |  |  |  |  |  |  |      |      |  |
|------|--------------|------|------|------|------|------|------|------|------|-----------------------------|---------------|--------------|----------------------|-----------|-----------|--------------|--------------|--|--|--|--|--|--|--|------|------|--|
| Core |              | Core |      | Core |      | Core |      | Core |      | Core                        |               | L            | Size                 | Line Size | Latency   | Associativty |              |  |  |  |  |  |  |  |      |      |  |
|      | core         |      |      |      | core |      | core |      |      |                             | conc          |              | IL                   | 32 KB     | 64 bytes  | 4 ns         | 8-way        |  |  |  |  |  |  |  |      |      |  |
|      |              |      |      |      |      |      |      |      |      |                             |               |              | L1 Instruction Cache |           |           |              |              |  |  |  |  |  |  |  |      |      |  |
|      | L1           | L1   | L1   | L1   | L1   | L1   | L1   | L1   | L1   | L1 L1                       |               | L1           |                      | Size      | Line Size | Latency      | Associativty |  |  |  |  |  |  |  |      |      |  |
|      | inst         | dat  | inst | dat  | inst | dat  | inst | dat  | inst | dat                         | inst          | dat          |                      | 32 KB     | 64 bytes  | 4 ns         | 4-way        |  |  |  |  |  |  |  |      |      |  |
|      |              |      |      |      |      |      |      |      |      |                             |               |              |                      |           |           |              |              |  |  |  |  |  |  |  | L2 C | ache |  |
|      | L2           |      | L2   |      | L2   |      | L2   |      | L2   |                             | L2            |              |                      | Size      | Line Size | Latency      | Associativty |  |  |  |  |  |  |  |      |      |  |
|      |              |      |      |      |      |      |      |      |      |                             |               |              | 256 KB               | 64 bytes  | 10 ns     | 8-way        |              |  |  |  |  |  |  |  |      |      |  |
|      |              |      |      |      |      |      |      |      |      | L3 Cache                    |               |              |                      |           |           |              |              |  |  |  |  |  |  |  |      |      |  |
| L3   |              |      |      |      |      |      |      |      | Size | Line Size                   | Latency       | Associativty |                      |           |           |              |              |  |  |  |  |  |  |  |      |      |  |
| L    |              |      |      |      |      |      |      |      |      | 12 MB 64 bytes 50 ns 16-way |               |              |                      |           |           |              |              |  |  |  |  |  |  |  |      |      |  |
|      |              |      |      |      |      |      |      |      |      | Main Memory                 |               |              |                      |           |           |              |              |  |  |  |  |  |  |  |      |      |  |
|      | Main Memory  |      |      |      |      |      |      |      |      | Size                        | Line Size     | Latency      | Associativty         |           |           |              |              |  |  |  |  |  |  |  |      |      |  |
|      | Wall Wellory |      |      |      |      |      |      |      |      | 64 bytes                    | 75 ns         |              |                      |           |           |              |              |  |  |  |  |  |  |  |      |      |  |

- » all writes to main memory are done in write-back cache mode
  - standard writes requires data to be cached (expensive cache miss)
  - non-temporal writes (especialy useful for large block writes)
  - prefetch instructions available

# Data Races – CPU Execution Pipelining – Superscalar Execution

- » modern CPUs have multiple execution units in each core (8 in Intel Haswell)
  - units have various capabilities (4x integer ALU, 2x FPU mul, 2x mem read, ...)
  - multiple µops with various latency executed in parallel during each per cycle
- » independent instructions can be executed out-of-order or in parallel
  - not using the same register or address
- » memory reads are never reordered
  - parallel independent reads
- » later (independent) reads can be reordered and executed before writes
  - serialized writes only



# **Volatile Variable – Memory Barrier**

#### making A and B volatile:

```
public volatile int A = 0;
public volatile int B = 0;
public int C = 0;
public int D = 0;
public void method1() {
    int r2 = A;
    B = 1;
    D = r2;
}
```

#### results into assembly code:

| 0x000000010710e08c: | mo∨  | 0xc(%rsi),%r11d   |
|---------------------|------|-------------------|
| 0x000000010710e090: | movl | \$0x1,0x10(%rsi)  |
| 0x000000010710e097: | lock | addl \$0x0,(%rsp) |
| 0x000000010710e09c: | mo∨  | %r11d,0x18(%rsi)  |

| 8B - mark word       |  |  |  |  |  |
|----------------------|--|--|--|--|--|
| 4B / 8B – Klass ref. |  |  |  |  |  |
| object data          |  |  |  |  |  |

- » operations over volatile are not reordered in C2 compiler
- » no need for read barriers not reordered during execution in CPU
- » lock prefix forbids all reordering around and synchronize previous writes to be visible by all others CPUs
- » *lock addl \$0x0,(%rsp)* is fastest memory barrier no operation inside CPU

# Volatile Variable

- » never cached thread-locally all access directly to main memory
- » guarantees **atomic read and write** operations (defines memory barrier)
- » can be used for both primitives and objects (references)
- » don't block thread execution
- » BUT:
  - volatile writes are much slower due to cache flush (~100x)
  - volatile reads (if there are writes) are slower (~25x, #CPU/cores)
    - due to invalidated cache
  - still faster than synchronization/locks
- » not necessary for:
  - immutable objects
  - variable accessed by only one thread
  - where variable is within complex synchronized operation

```
public class VolatileCounter {
    private volatile int cnt=0;
    public int get() {
        return cnt;
    }
    public void increment() {
        cnt++;
    }
}
```

» will it work as expected in multi-threaded environment?

## Counter Example - Volatile

```
public class VolatileCounter {
    private volatile int cnt=0;
```

```
public int get() {
    return cnt;
}
public void increment() {
    cnt++;
}
}
```

#### increment assembly code:

0x00000010911544c: mov 0xc(%rsi),%edi **RSI is this** 0x000000010911544f: inc %edi 0x0000000109115451: mov %edi,0xc(%rsi) 0x0000000109115454: lock addl \$0x0,(%rsp)

- 8B mark word 4B / 8B – Klass ref. ... object data
- » will it work as expected in multi-threaded environment?
  NO
- » volatile
  - not suitable for read-update-write operations
  - useful for one-thread write (e.g. termination flag)
    - must be used if flag is set by different thread otherwise C2 compiler could create infinite loop without testing

# **Volatile Arrays**

```
public class VolatileIntArray {
    private volatile int[] array;

public VolatileIntArray(int capacity) {
    array = new int[capacity];
    }

public int get(int index) {
    return array[index];
    }

public void put(int index, int value) {
    array[index] = value;
    }
}
```

» Is put operation to array member volatile?

### **Volatile Arrays**

}

# public class VolatileIntArray { private volatile int[] array;

```
public VolatileIntArray(int capacity) {
    array = new int[capacity];
}
public int get(int index) {
    return array[index];
}
public void put(int index, int value) {
    array[index] = value;
}
```

8B - mark word

4B / 8B – Klass ref.

... object data

8B - mark word

4B / 8B - Klass ref.

4B – array length

sequence of values

» Is put operation to array member volatile?

**NO** – see assembly code, there is no cache synchronization with lock

```
# this:
           rsi:rsi
                     = 'datarace/VolatileIntArray'
                     = int
# parm0:
           rdx
# parm1:
           rcx
                     = int
0x000000011170bbcc: mov
                         0xc(%rsi),%esi
0x00000011170bbcf: shl
                          $0x3,%rsi
                                            ;*getfield array
                                            ; - datarace.VolatileIntArray::put@1 (line 15)
0x00000011170bbd3: movslq %edx,%rdi
0x00000011170bbd6: cmp
                          0xc(%rsi),%edx
                                            ; implicit exception: dispatches to 0x00000011170bbef
                          0x00000011170bbf9 — ArrayOutOfBoundsException
0x00000011170bbd9: jae
                         %ecx.0x10(%rsi,%rdi,4) ;*iastore
0x000000011170bbdf: mov
                                            ; - datarace.VolatileIntArray::put@6 (line 15)
```

#### private volatile int[] array;

```
public void put(int index, int value) {
    array[index] = value;
    array = array;
}
```

```
8B - mark word
4B / 8B – Klass ref.
... object data
```

- » just array reference is volatile
- » added unnecessary array reference update adds assembly code

0x000000010db21a67: mov %r8d,0xc(%rsi)
0x00000010db21a80: lock addl \$0x0,(%rsp) ;\*putfield array

- ; datarace.VolatileIntArray::put@12 (line 16)
- » lock prefix forbids all reordering around and synchronize previous writes to be visible by all others CPUs
- » not suitable for read-update-write operations

# **Counter Example – Synchronized and ReentrantLock**

```
public class SynchronizedCounter {
                                             public class ReentrantCounter {
    private int cnt=0;
                                                 private int cnt=0;
                                                 private ReentrantLock lock = new ReentrantLock();
    public int get() {
                                                 public int get() {
        return cnt;
    }
                                                     return cnt;
                                                 }
    public synchronized void increment() {
                                                 public void increment() {
        cnt++;
                                                     lock.lock();
    }
}
                                                     try {
                                                         cnt++;
                                                     } finally {
                                                         lock.unlock();
                                                 }
                                             }
```

- » no issue with read-update-write operations
- » synchronized
  - method vs. block
  - object instance vs. class instance (static methods)

# JVM - Synchronize Implementation



- » prototype mark word in Klass
- » lock records in stack (on pre-compiled locations for compiled code)
  - 8B displacement of original object mark word recursive lock has 0
  - 4B / 8B compressedOOP/OPP to locked object
- » thin locking using CAS instruction on lock/unlock to modify mark word
  - use spin-locking (10 cycles with volatile read + NOPs) before fat locking
- » fat locking monitor object on heap (created by inflating, deflating)
  - contended lock or call of wait/notify
  - monitor: original mark word, OS lock, conditions, set of threads; support parking

20<sup>th</sup> March 2017

# JVM - Synchronize Implementation



- » **biasing locking** fast locking/unlocking by single thread without any CAS
  - biasable enabled 4 seconds after JVM start (startup-up, learning)
  - different thread and valid epoch -> instance re-biasing OR thin/fat locking
  - global safe pointing needed biasable, re-biasing, bias revocation
  - bulk operations amortizing cost for safe pointing (all instance types)
     >20 re-biasing -> bulk re-biasing (increment epoch in prototype, scan locks)
     >40 re-biasing -> bulk revocation (change in prototype)
  - mark word **normalization** during GC preserve hashed, locked, un-biasable
  - identity hash or fat lock disable instance biasing locking

# JVM - Synchronize Implementation



### **Reentrant Locks**

- » extended operations in comparison to **synchronized**:
  - lock(), unlock()
  - lockInterruptibly() throws InterruptedException
  - boolean tryLock()
  - boolean tryLock(long timeout, TimeUnit unit) throws InterruptedException
- » fairness
  - **new** ReentrantLock(boolean fair), by default unfair
  - **synchronized** is unfair
  - unfair ReentrantLocks are slightly faster than synchronized
    - but another instance in HEAP
  - fair locks are slower (~100x)

### **Counter Example – AtomicInteger**

```
public class AtomicCounter {
    private AtomicInteger cnt = new AtomicInteger( initialValue: 0);
    public int get() {
J
         return cnt.get();
    public void increment() {
1
         cnt.incrementAndGet();
 }
AtomicInteger implementation:
private static final long valueOffset;
static {
    try {
       valueOffset = unsafe.objectFieldOffset
           (AtomicInteger.class.getDeclaredField( name: "value"));
    } catch (Exception ex) { throw new Error(ex); }
private volatile int value;
public final int getAndAddInt(Object var1, long var2, int var4) {
   int var5;
   do {
                                                                           non-blocking
       var5 = this.getIntVolatile(var1, var2);
                                                                           pattern
    return var5;
public final int getAndIncrement() {
                                                                                       20
    return unsafe.getAndAddInt( o: this, valueOffset, i: 1);
}
```

# **Counter Example – AtomicInteger – Assembly Code**

#### <u>C2 compiler assembly code for AtomicCounter::getAndIncrement:</u>



- » while cycle optimized and replaced with **single instruction**
- » lock prefix forbids all reordering around and synchronize previous writes to be visible by all others CPUs
- » **lock prefix** ensures that core has exclusive ownership of the appropriate cache line for the duration of the operation
  - cache coherency using MESIF (Haswell) with fall-back to mem bus lock
- » AtomicInteger-based counter is fastest of all for multi-threaded

#### **Atomic Operations**

- » 32-bit CPUs support 64-bit CAS operations
  - **cmpxchg** src\_operand, dst\_operand implicit lock prefix
- » 64-bit CPUs support 128-bit CAS operations
  - **cmpxchg16b** works with RDX:RAX and RCX:RBX register pairs
- » JAVA uses only 64-bit version in java.util.concurrent.atomic
  - AtomicBoolean
  - AtomicInteger
  - AtomicLong
  - AtomicReference
  - AtomicIntegerArray
  - AtomicLongArray
  - AtomicReferenceArray

# **Atomic Field Updaters**

- » suitable with large number of object of the given type it saves memory
  - don't require single instance to have an extra object embedded
- » refer variable "normally" without getter and setters

```
public class ObjectWithAtomic {
   private final AtomicInteger value =
        new AtomicInteger(0);
   // ...
   public void method1() {
       // ...
        if (value.compareAndSet(1, 2)) {
            // ...
        }
    }
3
public class ObjectWithAtomic {
    private static AtomicIntegerFieldUpdater<ObjectWithAtomic>
        valueUpdater = AtomicIntegerFieldUpdater.nevUpdater(ObjectWithAtomic.class, "value");
    private volatile int value = 0;
    // ...
    public void method1() {
        // ...
        if (valueUpdater.compareAndSet(this, 1, 2)) {
            // ...
        }
```

# **Atomic Field Updaters**

- » but beware of less efficient operations over atomic field updaters
- » AtomicIntegerFieldUpdater:

```
private void fullCheck(T obj) {
    if (!tclass.isInstance(obj))
        throw new ClassCastException();
    if (cclass != null)
        ensureProtectedAccess(obj);
}
public boolean compareAndSet(T obj, int expect, int update) {
    if (obj == null || obj.getClass() != tclass || cclass != null) fullCheck(obj);
    return unsafe.compareAndSwapInt(obj, offset, expect, update);
}
```

- » existing field updaters:
  - AtomicIntegerFieldUpdater
  - AtomicLongFieldUpdater
  - AtomicReferenceFieldUpdater
- » no array field updater exists

### **Atomic Complex Types**

- » AtomicMarkableReference
  - object reference along with a mark bit
- » AtomicStampedReference
  - object reference along with an integer "stamp"
- » <u>notes</u>:
  - useful for ABA problem
    - A -> B and B -> A, how can I know that A has been changed since the last observation?
  - doesn't use double-wide CAS (CAS2, CASX) -> much slower than simple atomic types due to **object allocation**

### **Atomic Complex Types – Larger Than 64-bits**

- » AtomicMarkableReference
  - object reference along with a mark bit
- » AtomicStampedReference
  - object reference along with an integer "stamp"

```
public class AtomicStampedReference<V> {
    private static class Pair<T> {
        final T reference;
        final int stamp;
        private Pair(T reference, int stamp) {
            this.reference = reference;
            this.stamp = stamp;
        static <T> Pair<T> of(T reference, int stamp) {
            return new Pair<T>(reference, stamp);
        3
    }
    private volatile Pair<V> pair;
                                     expectedReference,
   public boolean compareAndSet(V
                                     newReference,
                                 v
                                 int expectedStamp,
                                 int newStamp) {
       Pair<V> current = pair;
       return
            expectedReference == current.reference &&
            expectedStamp == current.stamp &&
            ((newReference == current.reference &&
              newStamp == current.stamp) ||
             casPair(current, Pair.of(newReference, newStamp)));
   }
```

# **Non-blocking Algorithms**

- » lock-free, block-less but not usually wait-free (note while loops)
  - based on CMPXCHG and LOCKed instructions
- » shared resources secured by locks:
  - high-priority thread can be blocked (e.g. interrupt handler)
  - parallelism reduced by coarse-grained locking (unfair locks)
  - fine-grained locking and fair locks increases overhead
  - can lead to **deadlocks**, **priority inversion** (low-priority thread holds a shared resource which is required by high-priority thread)
- » non-blocking algorithms properties:
  - outperform blocking algorithms because most of CMPXCHG succeeds on the first try
  - removes cost for synchronization, thread suspension, context switching
- » note: wait-free is mandatory mandatory for real-time systems

# Non-blocking stack (LIFO)

#### » Treiber's algorithm (1986)

```
static class Node<E> {
    final E item:
    Node<E> next;
    public Node(E item) { this.item = item; }
}
AtomicReference<Node<E>> head = new AtomicReference<Node<E>>();
public void push(E item) {
    Node<E> newHead = new Node<E>(item);
    Node<E> oldHead:
    do {
        oldHead = head.get();
       newHead.next = oldHead;
    } while (!head.compareAndSet(oldHead, newHead));
}
public E pop() {
    Node<E> oldHead;
    Node<E> newHead;
    do {
        oldHead = head.get();
        if (oldHead == null)
            return null:
        newHead = oldHead.next;
    } while (!head.compareAndSet(oldHead,newHead));
    return oldHead.item;
3
```







sequnce of removal-addition if address is reused cause ABA

# Thread-safe collections and maps

#### » blocking variants:

- static<T> Collection<T> Collections.synchronizedCollection(Collection<T> c)
- static<T> List<T> Collections.synchronizedList(List<T> list)
- static<K,V> Map<K,V> Collections.synchronizedMap(Map<K,V> m)
- static<T> Set<T> Collections.synchronizedSet(Set<T> s)
- also for SortedSet and SortedMap
- » non-blocking variants:
  - ConcurrentLinkedQueue (interface Collection, Queue):
    - E peek(), E poll(), add(E)
  - ConcurrentHashMap (interface Map):
    - putIfAbsent(K key, V value), remove(Object key, Object value)
    - replace(K key, V oldValue, V newValue)
  - ConcurrentSkipListMap (interface SortedMap), ConcurrentSkipListSet (interface SortedSet)

### ConcurrentHashMap

- » concurrent readability get, iterator
- » minimize update contention
  - initial concurrency level 16 (can be changed) # updating threads
    - initial insertion into empty bin uses CMPXCHG operation
    - later modifications are based on bin-based locks



### ConcurrentHashMap

- » table resizing (occupancy exceed load factor)
  - power of two expansions
    - same index or power of two index
  - reusing internal Node if next is not changed majority of cases
  - any thread can help resizing instead of block
  - Forward nodes to notify users about moved