6. Pipeline and Hazards

for lecturer: tutorial 6

Class outline

Fibonacci sequence
Transcription C code to assembler
Simulation and debugging for processor without pipeline selected in QtRvSim simulator.
Simulation and debugging for processor with pipeline enabled in QtRvSim simulator.

What should I know before the class

To understand the lecture about pipeline and hazards.

Program to demonstrate pitfalls of pipeline execution

.globl _start

.option norelax

.text
_start:

main:

    addi  x2,  x0, 10
    add   x11, x0, x2   // A : x11<-x2
    add   x12, x0, x2   // B : x12<-x2
    add   x13, x0, x2   // C : x13<-x2

la_auipc_inst_addr:
    la x5, varx  // $5 = (byte*) &varx; 
    // The macro-instruction la is compiled as two following instructions:
    //auipc x5, %pcrel_hi(varx) // load the upper part of address
    //addi  x5, x5, %pcrel_lo(la_auipc_inst_addr) // append the lower part of address
    // they compute and load address as relative to the PC, absolute load address alternative
    //lui   x5, %hi(varx) // load the upper part of address
    //addi  x5, x5, %lo(varx) // append the lower part of address
    // It can be replaced by simple single addi if varx is located lower than 0x800
    //addi  x5,  x0, varx

    lw    x1, 0(x5)     // x1 = *((int*)$5);
    add   x15, x0, x1   // D : x15<-x1
    add   x16, x0, x1   // E : x16<-x1
    add   x17, x0, x1   // F : x17<-x1
loop:
    ebreak
    beq    x0, x0, loop
    nop

.data
.org 0x400
varx:
	.word  0x1234

Trace program step by step:

the first, on CPU with disabled pipeline,
then activate pipeline but left hazard unit switched off. Propose rules to execute program expected way.
Execute program on CPU with hazard unit with and without forwarding.

Remark. Data and instruction cache are not important, both can be disabled.

Observe and analyze not only results stored in registers but even possible stall states and control signals if hazard unit is activated.

When are instructions A, B, C, D, E and F results computed and stored into registers and when are results correct/respect instructions program order?
Mow many cycles are required to execute whole program?

Number of required cycles can be read in bottom right corner of CPU window.

Question to analyze: If QtRVSim requires more cycles to execute program when pipeline is enabled than if executed without pipeline, does it mean that pipelined processor is generally slower?

Design enhancement: Try to modify program to better utilize pipelined execution. Is it possible to decrease number of stalls or even achieve state when it can be executed with expected results if hazard unit is switched off?

What shall we do today?

Write a code for calculation of N-th Fibonacci number (for N > 2). Fibonacci sequence is defined as follows:

F(n) = F(n-1) + F(n-2), for n > 2, and F(0) = 0, F(1) = 1.

Here is the first few numbers in the Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,…

In your program you may use following instructions:

Possible solution in C:

t0 = 5;  //  Set value of N
s0 = 0;  //  F(0)
s1 = 1;  //  F(1)
 
for(t1 = 2; t1 <= t0; t1++)
{
        t2 = s0 + s1;
        s0 = s1;
        s1 = t2;
}
 
while(1)
        ;   // Endless loop

Template:

.globl start

.option norelax

start:
// Here, there is the place for your code

nop
.end start

Debug your code for QtRvSim simulator in pipelined mode with hazard unit switched off.

Compile your code with this pseudoinstruction, try to execute your code in the QtRVSim simulator without pipeline and observe the differences. Modify your code for the pipelined version of processor with hazard unit disabled in such way, that it will produce the same value as on processor without pipeline.

Try to find out rules for the compiler, with which the compiler will produce the program without data and control hazards - program will have the same results as in QtRvSim simulator (without pipeline).

For those with spare time

Modify your code to write the result (F(N) + 15) to memory on address 0x02 (using sw instruction) and then read the value back into a register (using lw instruction). Execute your program in MipsPipeS and MipsPipeXL simulators. Observe the execution closely, namely the sw and lw instructions.

Questions:

Find out how the add instruction is executed.
Find out how the addi instruction is executed.
Find out how the lw instruction is executed.
Find out how the sw instruction is executed.
How many clocks does it take to find out the branch target address? And how I will find it out? (instructions beq a bne)

Linear code example for demonstration of the instructions advancing through pipeline

Available at path /opt/apo/pipe-test in the lab

.globl _start
.text
.set noat
.set noreorder

_start:
  nop
  nop
  nop
  nop
  nop
  addi t0,x0,0
  addi t1,x0,0
  addi t2,x0,0
  addi t3,x0,0
  addi t4,x0,0
  addi t5,x0,0
  addi t6,x0,0
  addi s1,x0,0x11
  addi s2,x0,0x22
  addi s3,x0,0x33
  addi s4,x0,0x44
  addi s5,x0,0x55
  addi s6,x0,0x66
  addi s7,x0,0x77
  addi s8,x0,0x88
  addi s9,x0,0x99
  nop
  nop
  nop
_test:
  addi t0,s1,0     // t0 register will be set to the value 0x1111 after four cycles
  // s1 should change value to 0x1133 but it would not happen in in the QtRvSim
  // if hazard unit is not enabled, the value propagation takes three cycles still,
  // the following two instructions read previous t0 value from the registers file
  addi s1,s1,0x22
  add  t1,x0,s1    // t1 register is set to the old value 0x1111 when pipeline and no hazard unit is set
  add  t2,x0,s1    // t2 register is set to the old value if forwarding or stalls are not set 0x1111
  add  t3,x0,s1    // write of the new value to s1 has finished right now, t3 will be set to 0x1133
  beq  x0,x0,skip
  // hazard in the control flow
  // next instructions take effect even that they should be skipped
  add  t5,x0,s1    
  add  t6,x0,s1    
  add  s2,x0,s1    
  add  s3,x0,s1    
skip:
  nop
  nop
  ebreak

Table of Contents