6. Branch predictions, code optimization

for lecturer: tutorial 6

Class outline

Branch prediction - an introduction
Static branch prediction - an example
Fibonacci sequence - code optimization for MipsPipeS
Dynamic branch predictions - an example

What should I know before the class

To understand previous class.

Náplň cvičení

1. Branch predictions - an introduction

Motivation - why is it good to correctly predict if a branch condition will be fulfilled and what will be the target of a branch.

As shown on the figure above - in branches we can execute (and obtain correct result of) a condition in later phases of the pipeline - and therefore the delay slot lasts several clocks. So these clock will be wasted or we have to find instructions from before the branch to fill these clocks. Such instructions are not always easy to find. Fortunately there is one other approach possible - if we could predict the correct result of the condition and the target of the branch, we could start to execute instructions from the target of the branch and overcome the difficulty with choosing the instructions to fill the delay slot. Since the branch instructions are relatively frequent in the programs, the efficient prediction could greatly improve the instruction throughput of the computer system.

Branch prediction consists of two basic parts:

branch target speculation (predict target address of the branch),
branch condition speculation (if the condition of the branch is satisfied or not).

To predict the target of the branch, the BTB (Branch Target Buffer) is generally used. The BTB is an associative cache containing two items: BIA (Branch Instruction Address) and BTA (Branch Target Address). The BTB is accessed during the Instruction Fetch pipeline stage. If the BIA item matches the PC (Program Counter) and we predict that the condition is satisfied, the PC is set to the BTA from BTB.

We can predict if the branch condition is satisfied or not using three approaches - statically, dynamically or hybrid. The most simple way to predict branch condition is assumption that the branch condition will not be satisfied and the program will continue in its execution as is. A little bit better prediction is BTFNT (Backwards Taken / Forwards Not-Taken) prediction.

Consider following pipelined processor:

The first method for branch prediction we will show is to flush (throw away) instruction behind the branch and to stall the pipeline. The pipeline is staled until the branch instruction reaches the MEM stage of the pipeline (in this stage the new value of the PC is finally known). We can illustrate the flow of instruction through the pipeline by following table:

	———	———	———	———	———	———	———	———	———	———
branch instruction	IF	ID	EX	MEM	WB
next instruction		IF(flush)	stall	stall	IF	ID	EX	MEM	WB
next instruction + 1						IF	ID	EX	MEM	WB

Pipeline stall occurs in ID stage (when we have identified the branch instruction) and lasts 3 clocks. Pipeline stall of 3 clocks is a considerable loss of performance. If we assume that branches are taken with 30% probability and ideal CPI = 1, then we will reach speedup only (2,63x against non-pipelined processor). The ideal speed up should be 5x. This branch prediction causes a big lost of performance.

Note that as a consequence of branch instruction we have execute IF once more - IF stage restart. We can not continue with instruction fetched in the second clock. We have fetched an instruction behind the branch instruction and if branch condition is satisfied, we should not execute this instruction.

We can simulate pipeline stalls in MipsPipeS simulator by inserting 3 NOP instruction after each branch instruction. By moving the branch condition evaluation into ID stage, we can shorten the stalls from 3 stalls to 1. See MipsPipeXL simulator ().

Static Predictions:

Next way how to predict branches is a Predict Not Taken. In this scheme we will continue with execution as if the branch condition satisfied and the branch is not taken. But important detail is, that the inner state of the processor (registers, flags) is not affected until the result of the branch instruction is known.

Now we will assume that the result of the branch is known in the ID stage of pipeline - as in MipsPipeXL ().

Program execution if we should not take branch (condition is not satisfied):

	——	——	——	——	——	——	——
branch instruction	IF	ID	EX	MEM	WB
next instruction		IF	ID	EX	MEM	WB
next instruction + 1			IF	ID	EX	MEM	WB

Program execution if we should take branch (condition is satisfied):

	——	——	——	——	——	——	——	——
branch instruction	IF	ID	EX	MEM	WB
next instruction		IF	idle	idle	idle	idle
target instruction			IF	ID	EX	MEM	WB
target instruction +1				IF	ID	EX	MEM	WB

If branch instruction does not branch (determined in ID stage) we can continue without any delay. If the branch condition is satisfied (and we should take the branch), we will restart the ID stage and set the new address. This will hold the execution for one clock (and causes 1 stall). Note that the instruction is flushed from the pipeline only if the branch condition is satisfied and the branch is taken.

Delayed Branch - is the next method for solving the difficulties with the branch instructions. We can define so called branch delay slot just behind the branch instruction. Any instruction in this branch delay slot will be executed no matter if we take the branch or not. Therefore if we put in there an instruction from before the branch instruction, we can improve the instruction throughput.

Execution of the program if the branch condition is not satisfied (delay slot for one instruction):

	——	——	——	——	——	——	——	——
branch instruction	IF	ID	EX	MEM	WB
next instruction (delay slot)		IF	ID	EX	MEM	WB
next instruction + 1			IF	ID	EX	MEM	WB
next instruction + 2				IF	ID	EX	MEM	WB

Execution of the program if the branch condition is satisfied (delay slot for one instruction):

	——	——	——	——	——	——	——	——
branch instruction	IF	ID	EX	MEM	WB
next instruction (delay slot)		IF	ID	EX	MEM	WB
target instruction			IF	ID	EX	MEM	WB
target instruction + 1				IF	ID	EX	MEM	WB

The goal of the compiler is to ensure, that instruction in the branch delay slot are useful and should be executed. On the following pictures are three alternatives how to fill in the branch delay slot.

The picture on the left shows how the delay slot was filled by independent instruction from before the branch instruction. Pictures in the middle and on the right shows possible solutions if it is not possible to find any independent instruction. In both examples it is not possible to use the left solution because of data dependency in the R1 register. Note that it is necessary to maintain correctness of the program and that the original and compiled code are not equivalent!

2. Static branch prediction - an example

Lets take a look on following code fragment and suppose, that a value s2 = s1 + 796 and s1 contains a memory address (beginning of some array).

loop: 	
  lw    s0, 0(s1)     // s0 <- Mem[s1]
  addi  s0, s0, 0x1   // s0++
  sw    s0, 0(s1)     // Mem[s1] <- s0
  addi  s1, s1, 0x4   // s1 = s1 + 4
  sub   t0, s2, s1    // t0 = s2 - s1
  bnez  t0, loop      // if (t0 != 0) goto loop

Task: Determine a number of clocks needed to execute above code fragment in following situations (assume that all memory accesses are cache hits):

Processor have no forwarding, but values can be written to and read from register field at once (same clock). The branch instruction is implemented using pipeline flush. Target address is determined in EX stage (as implemented in MipsPipeS ).
Processor is similar to MipsPipeXL simulator , that means: results can be forwarded between different pipeline stages. Branch instructions are assumed to be not taken.
Processor is similar to MipsPipeXL simulator . In this case you can reorder the instructions to fill in the delay slot after the conditional branch insruction (single-cycle delayed branch).

3. Fibonacci sequence - code optimisation

In the last tutorial we have written a program to compute Fibonacci sequence for “classic” MIPS processor. Then we have modified it by inserting NOP instruction to ensure the correct execution order even in pipelined processor who do not solve data hazards (MipsPipeS). You should end up with a program similar to following:

#define t0 $8
#define t1 $9
#define t2 $10
 
#define s0 $16
#define s1 $17
#define s2 $18
 
.globl start
.set noat
.set noreorder
.ent start
 
start:
addi t0, $0, 0x5     //  t0 = 5;  //  Nastaveni hodnoty N
addi s0, $0, 0x0     //  F(0)
addi s1, $0, 0x1     //  F(1)
                     
addi t1, $0, 0x2     //  t1 = 2;
nop                  
                     
loop:                //  {
add t2, s0, s1       //     t2 = s0 + s1;
addi s0, s1, 0x0     //     s0 = s1;
nop                  //
addi s1, t2, 0x0     //     s1 = t2;
addi t1, t1, 0x1     //     t1++;
nop                  //
nop                  //  }
bne t0, t1, loop     //  test t1 < t0
nop
nop
nop
 
addi s1, s1, 15      // s1 += 15;
 
inf_loop:
beq $0,$0, inf_loop  
 
nop
.end start

To compute N-th Fibonacchi number we have executed 6*(N-2)+1 NOP instructions. In our case of N=5, it means 19 useless NOP instruction. In total CPU has executed 43 instructions to calculate the result and to write it into S1 register.

Task: Reorder instructions in program to get the same result in shorter time (with less instruction executed). As a hint you can use the above scheme illustrating how to fill the Delay Branch Slot. The other way how to minimize number of NOP instructions is to reorder the instructions. The instructions are executed in order different to order in which they are written in the original program. The result must remain the same. The reason to do this is to push depending instructions away from each other to break the data hazard.

Use the MipsPipeS simulator!

Note: The simplest way to find the number of instructions executed is to add “Hit count” and “Miss count” in instruction cache.

4. Dynamic branch prediction - an example

Consider following example - the CPU is executing a program and loads a branch instruction. The prediction of the test result is the same as in the previous execution of this branch instruction.We have for this reason a Branch History Table (BHT). BHT stores the last result of the branch instruction. Each executed branch instruction has record in BHT. This prediction can be described using Moore's finite state automaton:

You can see, that one bit is enough to express the state of the BHT. In the beginning the BHT is in the “Not Taken” state, where we predict that the branch is not taken. If the prediction was correct, we will stay in this state. If the prediction was wrong we will change the prediction to “Taken” and when this branch instruction is reached again, the prediction will be “Taken”.

A bit more advanced approach to branch prediction is takes into account longer history - as shown on figure below. In the beginning we start in “Strongly not taken” state“. When in “Strongly not taken” a “Weakly not taken” states we predict that branch will not be taken. In this case we need 2 bits to represents all 4 states.

Task: Lets have following code fragment. Count number of wrong predictions for following cases:

BHT (branch history table) with 1-bit counter
BHT (branch history table) with 2-bit counter

int i, j, c=0;
 
for (i=0; i<500; i++) {
  for (j=0; j<4; j++) {
     c++;
  }
}

Code fragment:

  // i = s0, j = s1, c = s2, t0 = outer loop limits, t1 = inner loop limit, t3 - auxiliary variable

  addi s1, $0, 0     // c = 0;
  addi t0, $0, 500   // t0 = 500;
  addi t1, $0, 4     // t1 = 4;
  
  addi s0, $0, 0     // i = 0;
L1:
  addi s1, $0, 0     // j = 0;
L2: 
  addi s2, s2, 1     // c++; 
  
  addi s1, s1, 1     // j++;
  slt  t3, s1, t1    // t3 = (s1 < t1) ? 1 : 0;
  bne  t3, $0, L2
  addi s0, s0, 1     // i++;
  slt  t3, s0, t0    // t3 = (s0 < t0) ? 1 : 0;
  bne  t3, $0, L1

Table of Contents