## Computer Architectures

# Real Numbers and Computer Memory Pavel Píša, Richard Šusta 

Michal Štepanovský, Miroslav Šnorek



Czech Technical University in Prague, Faculty of Electrical Enaineerina English version partially supported by:
European Social Fund Prague \& EU: We invests in your future.

APO at Dona
$84^{\circ} 28^{\prime} 45^{\prime \prime} \mathrm{E}, 28^{\circ} 29^{\prime} 52^{\prime \prime} \mathrm{N}, 4038 \mathrm{~m}, 2019-11-28$ APO at InstallFest (https://installfest.cz) 2021-03-06 viaisigBlieButton running at $50^{\circ} 4^{\prime 3} 36.682^{\prime \prime} \mathrm{N}, 14^{\circ} 25^{5} 4.116^{\prime \prime}$ E
QIMIPS Hands on Session 10 Understand Computer Architecteres and-Discuss Its Jeaching EmbeddedlLinux. FPGA and Motión Control Hands-On

## Speed of Arithmetic Operations

| Operation | C language operator |
| :--- | :--- |
| Bitwise complement (negation) | $\sim \mathrm{x}$ |
| Multiply and divide by $2^{\mathrm{n}}$ | $\mathrm{x} \ll \mathrm{n} \quad, \quad \mathrm{x} \gg \mathrm{n}$ |
| Increment, decrement | $++\mathrm{x}, \quad \mathrm{x}++, \quad--\mathrm{x}$, <br> $\mathrm{x}--$ |
| Negate $\leftarrow$ complement + increment | -x |
| Addition | $\mathrm{x}+\mathrm{y}$ |
| Subtraction <- negation + addition | $\mathrm{x}-\mathrm{y}$ |
| Multiply on hardware multiplier | $\mathrm{x} * \mathrm{y}$ |
| Multiply on sequential multiplier/SW | $\mathrm{x} / \mathrm{y}$ |
| Division |  |

## Multiply/Divide by 2

## Logical Shift

## Arithmetic Shift

Multiply by 2


## Divide by 2 unsigned



C represents Carry Flag, it is present only on some processors: x86/ARM yes, MIPS no

## Divide by 2 signed



## Barrel Shifter



Barrel shifter can be used for fast variable shifts

## Overflow of Unsigned Number Binary Representation

- The carry from MSB (the most significant bit) is observed in this case
- The arithmetic result is incorrect because it is out of range.

For 5 bit representation:


$\frac{28}{21}+$| 1 | 1 | 1 | 0 | 0 |
| ---: | ---: | ---: | ---: | ---: |
| $? 17$ |  |  |  |  |
| 1 | 1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 | 1 |



The incorrect result is smaller than each of addends

## Overflow of Signed Binary Representation

- Result is incorrect, numeric value is outside of the range that can be represented with a given number of digits
- It is manifested by result sign different from the sign of addends when both addends signs are the same, and
- the exclusive-or (xor) of carry to and from MSB differs.

For 5 bit representation:



## Sign Extension

## Example:

short int $x=15213 ;$ int $i x=(i n t) x$; short int $y=-15213 ;$ int $\quad$ iy $=$ (int) $y$;

|  | Decimal | Hex |  |  | Binary |  |  |  |
| :--- | ---: | ---: | ---: | ---: | ---: | ---: | :---: | :---: |
| $x$ | 15213 | 3B 6D | 00111011 | 01101101 |  |  |  |  |
| ix | 15213 | 00 | 00 | C4 92 | 00000000 | 00000000 |  |  |
| $y$ | -15213 |  | C4 93 |  | 1111011 | 01101101 |  |  |
| iy | -15213 | FF FF C4 93 | 11111111 | 11111111 | 11000100 | 10010011 |  |  |

## Hardware Divider - Simple Sequential Algorithm

## Non-restoring division

| $7 / 3$ |  |
| :--- | ---: |
| $7-4 * 3=-5$ |  |
| (non-restoring) | $\square$ |
| $-5+2 * 3=1$ |  |
| $=7-2 * 3$ | $\square$ |
| $1-3=-2$ | $\square$ |
| (restoring) |  |
| $-2+3=1$ |  |
| Restoring is required <br> only for last operation |  |

## 111 : 011

## Hardware divider logic (32b case)

| 111 | 011 |
| :--- | :--- | :--- | :--- |
| dividend | $=$ quotient $\times$ divisor + reminder |



## Algorithm of the sequential division

```
MQ = dividend;
B = divisor; (Condition: divisor is not 0!)
AC = 0;
for( int i=1; i <= n; i++) {
    SL (shift AC MQ by one bit to the left, the LSB bit is kept on zero)
    if(AC >= B) {
        AC = AC - B;
        MQ = 1; // the LSB of the MQ register is set to 1
    }
}
```

$\rightarrow$ Value of MQ register represents quotient and AC remainder

## Example of $\mathrm{X} / \mathrm{Y}$ division

| Dividend $x=1010$ and divisor $y=0011$ |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| i | operation | AC | MQ | B | comment |
|  |  | 0000 | 1010 | 0011 | initial setup |
| 1 | SL | 0001 | 0100 |  |  |
|  | nothing | 0001 | 0100 |  | the if condition not true |
| 2 | SL | 0010 | 1000 |  |  |
|  |  | 0010 | 1000 |  | the if condition not true |
| 3 | SL | 0101 | 0000 |  | $r \geq y$ |
|  | $A C=A C-B ; M Q_{0}=1 ;$ | 0010 | 0001 |  |  |
| 4 | SL | 0100 | 0010 |  | $r \geq y$ |
|  | $A C=A C-B ; \quad M Q_{0}=1 ;$ | 0001 | 0011 |  | end of the cycle |

$x: y=1010: 0011=0011$ reminder 0001, (10:3=3 reminder 1)

## *Real Numbers

and their representation in computer

## Higher Dynamic Range for Numbers (REAL/float)

- Scientific notation, semi-logarithmic, floating point
- The value is represented by:
- EXPONENT (E) - represents scale for given value
- MANTISSA (M) - represents value in that scale
- the sign(s) are usually separated as well
- Mantissa x base ${ }^{\text {Exponent }}$
- Normalized notation
- The exponent and mantissa are adjusted such way, that mantissa is held in some standard range. Usually $\langle 1$, base)
- When considered base $z=2$ is considered then mantissa range is $\langle 1,2$ ) or alternatively $\langle 0.5,1$ ).
- Decimal representation: $7.26478 \times 10^{3}$
- Binary representation: $1,010011 \times 2^{1001}$


## Fractional Binary Numbers/Fixed Point

They can be used directly or as base for mantissa of float


Real number representation in fixed point (fractional numbers)
Bits following "binary point" specify fractions in power two series

## Fixed Point Examples

## Value Representation

5+3/4 101.11
$2+7 / 8 \quad 10.111_{2}$
63/64 $0.111111_{2}$
Operations
Divide by $2 \rightarrow$ shift right
Multiply by $2 \rightarrow$ shift left.
Numbers $0.111111 \ldots{ }_{2}$ are smaller than 1.0

$$
1 / 2+1 / 4+1 / 8+\ldots+1 / 2^{i}+\ldots \rightarrow 1.0
$$

Exact notation $\rightarrow 1.0-\varepsilon$

## Binary and Decimal Real Numbers Examples

$23.47=2 \times 10^{1}+3 \times 10^{0}+4 \times 10^{-1}+7 \times 10^{-2}$
$\uparrow$ decimal point
$10.01_{\text {two }}=1 \times 2^{1}+0 \times 2^{0}+0 \times 2^{-1}+1 \times 2^{-2}$
$\uparrow$ binary point

$$
\begin{aligned}
& =1 \times 2+0 \times 1+0 \times 1 / 2+1 \times 1 / 4 \\
& =2+0.25=2.25
\end{aligned}
$$

## Scientific Notation and Binary Numbers

## Decimal number:

$-123000000000000 \rightarrow-1.23 \times 10^{14}$
$0.000000000000000123 \rightarrow+1.23 \times 10^{-16}$

Binary number:
$110110000000000 \rightarrow 1.1011 \times 2^{14}=29696_{10}$
$-0.00000000000000011101 \rightarrow-1.1101 \times 2^{-16}$
$=-2.765655517578125 \times 10^{-5}$

## Standardized Format for REAL Type Numbers

- Standard IEEE-754 defines next REAL representation and precision
- single-precision - in the C language declared as float
- uses 32 bits $(1+8+23)$ to represent a number
- double-precision - C language double
- Uses 64 bits $(1+11+52)$ to represent a number
- actual standard (IEEE 754-2008) adds half-precision float (16 bits ) mainly for graphics and neural networks, quadruple-precision (128 bits) and octuple-precision (256 bits) for special scientific computations


## The Representation/Encoding of Floating Point Number

- Mantissa encoded as the sign and absolute value (magnitude) - equivalent to the direct representation
- Exponent encoded in biased representation (K=+127 for single precision, +1023 for double)
- The implicit leading one can be omitted due to normalization of $m \in\langle 1,2)-23+1$ implicit bit for single

$$
\begin{array}{ll}
X=-1^{s} 2^{A(E)-127} m & \text { where } m \in\langle 1,2) \\
& m=1+2^{-23} M
\end{array}
$$

Sign of $M$


Radix point position for E and M

## ANSI/IEEE Std 754-1985 - 32b and 64b Formats

ANSI/IEEE Std 754-1985 - single precision format - 32b


ANSI/IEEE Std 754-1985 - double precision format - 64b

$$
g \ldots 11 b \quad f \ldots 52 b
$$

ANSI/IEEE Std 754-1985 - half precision format - 16b

$$
g \ldots 5 b \quad f \ldots 10 b
$$

## Examples of (De)Normalized Numbers in Base 10 and 2



## IEEE 754 - Conversion Examples

## Find IEEE-754 float representation of -12.625 ${ }_{10}$

- Step \#1: convert $-12.625_{10}=-1100.101_{2}=101 / 8$
- Step \#2: normalize $-1100.101_{2}=-1.100101_{2}$ * $2^{3}$
- Step \#3:

Fill sign field, negative for this case -> S=1.
Exponent + 127 -> 130 -> 10000010 .
The first mantissa bit 1 is a hidden one ->

### 110000010.10010100000000000000000

Alternative approach, separate sign, find floor of binary logarithm for absolute value, compute equivalent power of two, divide number (result is normalized) and, subtract one, multiply by two, if > 1 subtract and append 1 to result else append 0 , multiply by two and repeat.

## Example 0.75

$$
\begin{aligned}
& 0.75_{10}=0.11_{2}=1.1 \times 2^{-1}=3 / 4 \\
& 1.1=1 . F \rightarrow F=1 \\
& E-127=-1 \rightarrow E=127-1=126=01111110_{2} \\
& S=0
\end{aligned}
$$

$00111111010000000000000000000000=$ 0x3F400000

## Example 0.110 - Conversion to Float

$$
\begin{aligned}
0.1_{10} & =0.000110011 \ldots \\
& =1.10011_{2} \times 2^{-4}=1 . \mathrm{F} \times 2^{\mathrm{E}-127}
\end{aligned}
$$

$F=10011 \quad-4=E-127$
$E=127-4=123=01111011_{2}$
$00111101110011001100110011001100110011 .$. $0 \times 3 D C C C C C D$, why the last is a $D$ ?

## Example 0.110 - Conversion to Float

## $0.1_{10}=0.0 \underline{00110011 \ldots \ldots_{2}}=$

ロ.

 0011011001101100110011001100100110011001100110011001100110011001100110011 0011001100110110011001100110010011001100110011001100110010011001100110011


 O01100110011001100110011001100100110011001100110011001100110011001100110011





 0010011001100110011001100110010011001100110011001100110101001100110011011
 Oill 1110011011001100100110010011001100110011001100110010011001100110011 민…

## Often Inexact Floating Point Number Representation

Decadic number with finite expansion $\rightarrow$ infinite binary expansion Examples:

$$
\begin{aligned}
& 0.1_{\text {ten }} \rightarrow 0.2 \rightarrow 0.4 \rightarrow 0.8 \rightarrow 1.6 \rightarrow 3.2 \rightarrow 6.4 \rightarrow 12.8 \rightarrow \ldots \\
& 0.1_{10}
\end{aligned}=0.00011001100110011 \ldots 20 \text {... } \begin{aligned}
& =0.00 \underline{111}_{2} \text { (infinite bit stream) }
\end{aligned}
$$

More bits only enhance precission of $0.1_{10}$ representation

## Real Number Representation - Limitations

## Limitation

Only numbers corresponding to $x / 2^{k}$ allows exact representation, all other are stored inexact
Value representation
1/3 $0.0101010101[01] \ldots 2$
$1 / 50.001100110011[0011] \ldots$
1/10 0.0001100110011[0011]...2

## Special Values - Not a Number ( NaN ) and Infinity

- If the result of the mathematical operation is not defined, such as the calculation of $\log (-1)$, or the result is ambiguous 0/0, +Inf + -Inf, then the value NaN (Not-aNumber) is saved
= exponent is set to all ones and the mantissa is nonzero.

| positive | 011111111 | mantisa !=0 | NaN |
| :---: | :---: | :---: | :---: |

- If the operation results only overflow the range or infinity is on input ( $\mathrm{X}++\mathrm{Inf}$ ) and result sign is unambiguous
Infinity

| positive | $\mathbf{0}$ | 11111111 | 00000000000000000000000 | +Inf |
| :--- | :--- | :--- | :--- | :--- |
| negative | $\mathbf{1} 11111111$ | 00000000000000000000000 | -Inf |  |

## Implied (Hidden) Leading 1 bit

- Most significant bit of the mantissa is one for each normalized number and it is not stored in the representation for the normalized numbers
- If exponent representation is zero then encoded value is zero or denormalized number which requires to store most significant bit and there is zero considered on usual hidden one location
- Denormalized numbers allow to keep resolution in the range from the smallest normalized number to zero but the computation when some of operands is denormalized is more complex. Some coprocessors do not support denormalized numbers and emulation is required to fulfill IEEE-754 strict requirements, Intel coprocessors supports denormalized numbers


## Underflow/Lost of the Precision for IEEE-754 Representation

- The case where stored number value is not zero but it is smaller than smallest number which can be represented in the normalized form
- The direct underflow to the zero can be prevented by extension of the representation range by denormalized numbers



## Representation of the Fundamental Values

## Zero

| Positive zero | $\mathbf{0} 00000000$ | 00000000000000000000000 | +0.0 |
| :--- | :--- | :--- | :--- | :--- |
| Negative zero | $\mathbf{1} 00000000$ | 00000000000000000000000 | $\mathbf{- 0 . 0}$ |

Infinity

| Positive infinity | $\mathbf{0}$ | 11111111 | 00000000000000000000000 | +Inf |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Negative infinity | $\mathbf{1} 11111111$ | $\mathbf{0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}$ | -Inf |  |

Representation corner values

| Smallest normalized | * 00000001 | 00000000000000000000000 | $\begin{aligned} & \mathbf{\pm \mathbf { 2 } ^ { ( 1 - 1 2 7 ) }} \\ & \pm 1.175510^{-38} \end{aligned}$ |
| :---: | :---: | :---: | :---: |
| Biggest denormalized | * 00000000 | 11111111111111111111111 | $\pm\left(1-2^{-23}\right) 2^{(1-126)}$ |
| Smallest denormalized | * 00000000 | 00000000000000000000001 | $\begin{aligned} & \pm \mathbf{2}^{-23} \mathbf{2}^{-126} \\ & \pm 1.401310^{-45} \end{aligned}$ |
| Max. value | 011111110 | 11111111111111111111111 | $\begin{aligned} & \left(2-2^{-23}\right) 2^{(127)} \\ & +3.4028 \quad 10^{+38} \end{aligned}$ |

## The Table in Another Format



Figure: Floating-point Binary

## Some Features of ANSI/IEEE Standard Floating-point Formats

| Feature | Single/Float | Double/Long |
| :--- | :--- | :--- |
| Word width in bits | 32 | 64 |
| Significand in bits | $23+1$ hidden | $52+1$ hidden |
| Significand range | $\left[1,2-2^{-23}\right]$ | $\left[1,2-2^{-52}\right]$ |
| Exponent bits | 8 | 11 |
| Exponent bias | 127 | 1023 |
| Zero $( \pm 0)$ | $e+$ bias $=0, f=0$ | $e+$ bias $=0, f=0$ |
| Denormal | $e+$ bias $=0, f \neq 0$ <br> represents $\pm 0 . f \times 2^{-126}$ | $e+$ bias $=0, f \neq 0$ <br> represents $\pm 0 . f \times 2^{-1022}$ |
| Infinity $( \pm \infty)$ | $e+$ bias $=255, f=0$ | $e+$ bias $=2047, f=0$ |
| Not-a-number $($ NaN $)$ | $e+$ bias $=255, f \neq 0$ | $e+$ bias $=2047, f \neq 0$ |
| Ordinary number | $e+$ bias $\in[1,254]$ <br> $e \in[-126,127]$ <br> represents $1 . f \times 2^{e}$ | $e+$ bias $\in[1,2046]$ <br> $e \in[-1022,1023]$ <br> represents $1 . f \times 2^{e}$ |
| min | $2^{-126} \cong 1.2 \times 10^{-38}$ | $2^{-1022 \cong 2.2 \times 10^{-308}}$ |
| $\max$ | $\cong 2^{128} \cong 3.4 \times 10^{38}$ | $\cong 2^{1024 \cong 1.8 \times 10^{308}}$ |

## IEEE-754 Formats



Source: Herbert G. Mayer, PSU

## X86 Extended Precision Format (80-bits)



## Bit 1. is not hidden in mantissa!

## Advanced readers note:

> Intel processors integrate arithmetic coprocessor on the single chip with processor (from Intel 80486), which computes float and double expressions in „extended precision" internally and the results are rounded to float/double when stored.
$>$ But Streaming SIMD Extensions (SSE) instructions (vector operations) from Intel Pentium III on provides only double precision and the result rounding/precission can be dependent on compiler selection

## IEEE-754 Special Values Summary

| sign bit | Exponent <br> representation | Mantissa | Represented value/meaning |
| :--- | :--- | :--- | :--- |
| 0 | $0<e<255$ | any value | normalized positive number |
| 1 | $0<e<255$ | any value | normalized negative number |
| 0 | 0 | $>0$ | denormalized positive number |
| 1 | 0 | $>0$ | denormalized negative number |
| 0 | 0 | 0 | positive zero |
| 1 | 0 | 0 | negative zero |
| 0 | 255 | 0 | positive infinity |
| 1 | 255 | 0 | negative infinity |
| 0 | 255 | $\neq 0$ | NaN - does not represent a number |
| 1 | 255 | $\neq 0$ | NaN - does not represent a number |

## Comparison

- Comparison of the two IEEE-754 encoded numbers requires to solve signs separately but then it can be processed by unsigned ALU unit on the representations

$$
A \geq B \Leftrightarrow A-B \geq 0 \Leftrightarrow D(A)-D(B) \geq 0
$$

- This is advantage of the selected encoding and reason why sign is not placed at start of the mantissa


## Addition of Floating Point Numbers

- The number with bigger exponent value is selected
- Mantissa of the number with smaller exponent is shifted right - the mantissas are then expressed at same scale
- The signs are analyzed and mantissas are added (same sign) or subtracted (smaller number from bigger)
- The resulting mantissa is shifted right (max by one) if addition overflows or shifted left after subtraction until all leading zeros are eliminated
- The resulting exponent is adjusted according to the shift
- Result is normalized after these steps
- The special cases and processing is required if inputs are not regular normalized numbers or result does not fit into normalized representation


## Hardware of the Floating Point Adder



## Multiplication of Floating Point Numbers

- Exponents are added and signs xor-ed
- Mantissas are multiplied
- Result can require normalization max 2 bits right for normalized numbers
- The result is rounded
- Hardware for multiplier is of the same or even lower complexity as the adder hardware - only adder part is replaced by unsigned multiplier


## Floating Point Arithmetic Operations Overview

Addition: $\quad \mathbf{A} \cdot \mathbf{z}^{\mathrm{a}}, \mathbf{B} \cdot \mathbf{z}^{\mathrm{b}}, \mathbf{b}<\mathbf{a} \quad$ unify exponents
$B \cdot z^{b}=\left(B \cdot z^{b-a}\right) \cdot z^{b-(b-a)} \quad$ by shift of mantissa

$$
A \cdot z^{a}+B \cdot z^{b}=\left[A+\left(B \cdot z^{b-a}\right)\right] \cdot z^{a} \text { sum }+ \text { normalization }
$$

Subtraction: unification of exponents, subtraction and normalization

Multiplication: $\mathbf{A} \cdot \mathbf{z}^{\mathrm{a}} \cdot \mathbf{B} \cdot \mathbf{z}^{\mathrm{b}}=\mathbf{A} \cdot \mathbf{B} \cdot \mathbf{z}^{\mathrm{a}+\mathrm{b}}$
A•B

- normalize if required
$A \cdot B \cdot z^{a+b}=A \cdot B \cdot z \cdot z^{a+b-1} \quad$ - by left shift
Division: $\quad A \cdot z^{\text {a }} / \mathrm{B} \cdot \mathbf{z}^{\mathrm{b}}=\mathrm{A} / \mathrm{B} \cdot \mathbf{z}^{\mathrm{ab}}$
A/B - normalize if required
$A / B \cdot z^{a-b}=A / B \cdot z \cdot z^{a-b+1} \quad-$ by right shift


## *Memory and Data and their store in computer memory

## John von Neumann Computer Block Diagram


-5 functional units - control unit, arithmetic logic unit, memory, input (devices), output (devices)
-An computer architecture should be independent of solved problems. It has to provide mechanism to load program into memory. The program controls what the computer does with data, which problem it solves.
-Programs and results/data are stored in the same memory. That memory consists of a cells of same size and these cells are sequentially numbered (address).
-The instruction which should be executed next, is stored in the cell exactly after the cell where preceding instruction is stored (exceptions branching etc.).
-The instruction set consists of arithmetics, logic, data movement, jump/branch and special/control instructions.

## Memory Address Space

It is an array of addressable units (locations) where each unit can hold a data value. Number/range of addresses same as addressable units/words are limited in size.


## Program Layout in Memory at Process Startup


$0 \times 00000000$

- The executable file is mapped ("loaded") to process address space - sections .data and .text (note: LMA != VMA for some special cases)
- Uninitialized data area (.bss - block starting by symbol) is reserved and zeroed for C programs
- Stack pointer is set and control is passed to the function _start
- Dynamic memory is usually allocated above _end symbol pointing after .bss


## Key Technological Gaps Prediction



Note: The increase in complexity of algorithms over time has been formalized in literature as the so-called Shannon's Law of Algorithmic Complexity.

## Memory and CPU Speed - Moore's Law



## PC Computer Motherboard



## Computer Architecture (Desktop x86 PC)



## From UMA to NUMA Development (Even in PC Segment)



MC - Memory controller - contains circuitry responsible for SDRAM read and writes. It also takes care of refreshing each memory cell every 64 ms .

## Intel Core 2 Generation



Northbridge became Graphics and Memory Controller Hub (GMCH)

## Intel i3/5/7 Generation



[^0]
## Memory Subsystem - Terms and Definitions

- Memory address - fixed-length sequences of bits or index
- Data value - the visible content of a memory location Memory location can hold even more control/bookkeeping information
- validity flag, parity and ECC bits etc.
- Basic memory parameters:
- Access time - delay or latency between a request and the access being completed or the requested data returned
- Memory latency - time between request and data being available (does not include time required for refresh and deactivation)
- Throughput/bandwidth - main performance indicator. Rate of transferred data units per time.
- Maximal, average and other latency parameters


## Memory Types and Maintenance

- Types: RWM (RAM), ROM, FLASH
- Implementation: SRAM, DRAM
- Data retention time and conditions (volatile/nonvolatile)
- Dynamic memories (DRAM, SDRAM) require specific care
- Memory refresh - state of each memory cell has to be internally read, amplified and fed back to the cell once every refresh period (usually about 60 ms ), even in idle state. Each refresh cycle processes one row of cells.
- Precharge - necessary phase of access cycle to restore cell state after its partial discharge by read
- Both contribute to maximal and average access time.


## Typical Memory Parameters

- Memory types: RWM (RAM), ROM, FLASH,
- RAM realization: SRAM (static), DRAM (dynamic).
- RAM = Random Access Memory

| type | transistor <br> s per cell | 1 bit area | data availability | latency |
| :---: | :---: | :---: | :---: | :---: |
| SRAM | cca 6 | $<0,1 \mu \mathrm{~m}^{2}$ | always | $<1 \mathrm{~ns}-5 \mathrm{~ns}$ |
| DRAM | 1 | $<0,001 \mu \mathrm{~m}^{2}$ | requires refresh today $20 \mathrm{~ns}-35 \mathrm{~ns}$ |  |

## Detail of static and Dynamic Memory Bit Cell



6 transistor static memory cell (single bit)
Single transistor cell of dynamic memory


## Flip-flop Circuits

## RS



D latch, level-controlled flip-flop


D flip-flop, edge-controlled flip-flop


## Usual SRAM Chip and SRAM Cell

## Usual SRAM chip



## SRAM memory cell

Bigger memory size?


## Usual Static Memory Chip Cell

Principle:


Area of one memory cell(bit):


SRAM memory cell
6-transistors CMOS, 4 trans. Version exists


## Usual SRAM Chip

## Typical synchronous SRAM chip


https://www.ece.cmu.edu/~ece548/localcpy/sramop.pdf

## Memory Cell Connection to Matrix

## bitline


bitline $=0$
row-address 1

bitline $=1$
row-address 1

bitline $=Z$

## Selector Switch - One from N Decoder

One Hot Decoder cz: Dekodér 1 ze 4


## Switch Analogy of Multiplexer

Multiplexer 2 to 1 or 1 of 2 cz :2 kanálový (2-vstupový) multiplexor


Multiplexer $\mathbf{4}$ to $\mathbf{1}$ or $\mathbf{1}$ of $\mathbf{4}$ cz: 4 kanálový (4-vstupový) multiplexor


## Memory Matrix



Register is necessary for synchronous memory implementation (SDRAM)

## Memory Matrix - Operation



Address is setup at input and it is confirmed by rising edge.

## Memory Matrix - Operation



## Memory Matrix - Operation



Decoder activates 1 of $N$ rows and the selected cells are connected to all columns bitlines

## Memory Matrix - Operation



Multiplexer selects column - Data $2=0$
When register is connected before multiplexer then whole row can be read at once and consecutive data words can be streamed out by multiplexer only switching columns

## Internal Architecture of the DRAM Memory Chip



This $4 \mathrm{M} \times 1$ DRAM is internally realized as an $2048 \times 2048$ array of 1 b memory cells

## Detail of Dynamic Memory Cell

Single transistor dynamic memory cell
bitline

$>$ nMOS transistor nMOS works as analog switch which connects selected cell to „bitline".
$>$ „wordline" controls which capacitor is connected to "bitline"

## Dynamic Memory Capacitor Parameters

| Today DRAM parameters |  |
| :--- | :---: |
|  | Capacity fF [femtofarad] |
| Capacitor capacity | from 10 fF to 50 fF |
| Bit line capacity | about 2 fF |

[Source: I'INSA de Toulouse]
fF - femtofarad
fF is SI unit equal to $10^{-15}$ Farads.

$$
10^{-6} \mathrm{~F}=1 \boldsymbol{\mu F}=10^{3} \mathrm{nF}=10^{6} \mathrm{pF}=10^{9} \mathrm{fF}
$$

$\sim 9 \mathrm{fF}$ is capacity between two plates of $1 \mathrm{~mm}^{2}$ area with distance between plates around 1 mm .

## Detail of Dynamic Memory Cell

Single transistor dynamic memory cell

$>$ Read operation is complex and slow, takes from 20 to 35 ns, and speedup is almost impossible
$>$ Read is destructive, capacitor is discharged and original value has to be restored (refreshed) after each read.
> Femto-farad capacitor spontaneously discharges in short time - it is necessary to refresh it, in optimum case 60 ms for each cell, but maintenance frequency is multiplied by row count. Required refresh rate depends on temperature

## DRAM Memories - Price Seems to Be Settled for Now

## Price for 1 megabit



Source: Wells Fargo Securities, LLC and Semiconductor Industry Association

## History of DRAM chips development



## Old School DRAM - Asynchronous Access

- The address is transferred in two phases - reduces number of chip module pins and is natural for internal DRAM organization
- This method is preserved even for today chips

RAS - Row Address Strobe,
 CAS - Column Address Strobe

## Phases of DRAM Memory Read



## EDO-RAM - About 1995

- Output register holds data during overlap of next read CAS phase with previous access data transfer this overlap ("pipelining") increases throughput



## SDRAM - end of 90-ties - synchronous DRAM

- SDRAM chip is equipped by counter that can be used to define continuous block length (burst) which is read together



## SDRAM - the Most Widely Used Main Memory Technology

- SDRAM - clock frequency up to $100 \mathrm{MHz}, 2.5 \mathrm{~V}$.
- DDR SDRAM - data transfer at both CLK edges, 2.5 V , I/O bus clock 100-200 MHz, 0.2-0.4 GT/s (gigatransfers per second)
- DDR2 SDRAM - lower power consumption 1.8 V , frequency up to $400 \mathrm{MHz}, 0.8 \mathrm{GT} / \mathrm{s}$
- DDR3 SDRAM - even lower power consumption at 1.5 V , frequency up to $800 \mathrm{MHz}, 1.6 \mathrm{GT} / \mathrm{s}$
- DDR4 SDRAM - $1.05-1.2 \mathrm{~V}$, I/O bus clock 1.2 GHz, 2.4 GT/s
- DDR5 SDRAM - expected 2019-2020, ~6 GT/s
- All these innovations are focused mainly on throughput, not on the random access latency which for large capacities is still 20 to 35 ns .


## Other Main Memory Types

- QDRx SDRAM (Quad Data Rate) - not twice as fast, allows only simultaneous read and write thanks to separated clocks for RD and WR, DDR are more effective than QDR for single access type only přístupu.
- GDDR SDRAM - today up to GDDR6, designed for graphics cards/GPUs
- based on DDR memories.
- data rate accelerated by wider output bus
- High Bandwidth Memory (HBM) is a high-performance RAM interface for 3D-stacked SDRAM from Samsung, AMD and SK Hynix.
- Another concept RDRAM (RAMBUS DRAM), which use completely different interface. Due to patent litigationare not in use in personal computers from 2003 year.


## Notes for Today SDRAMs and Slides

- Use of the banked architecture that enables throughput to be increased by hiding latency of the opening and closing rows. These operations can proceed in parallel on different banks (sequential and interleaved banks mapping). The change result in a minimal pin count increase that is critical for price and density.
- Ulrich Drepper, Red Hat, Inc., What Every Programmer Should Know About Memory


# *Multi-byte Numbers and their store in computer memory 

## How to Store Multi-byte Number in Memory

## Hexadecimal number: 0x1234567

Big Endian - downto 0x100 0x101 0x102 0x103

|  |  | 01 | 23 | 45 | 67 |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

Little Endian - to
0x100 0x101 0x102 0x103

|  |  | 67 | 45 | 23 | 01 |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |



Little-Endien comes from a book by Gulliver's Travels, Jonathan Swift 1726, in which he referred to one of the two opposing factions of the Lilliput. Ones ate eggs from the narrow end to the broader while
Big Endien proceeded the other way around. And the war did not wait long ...

Do you remember how war ended?

## Memory Alignment (cz:zarovnání paměti?)

## .align n directive

- next space allocated for data or text starts at $2^{n}$ divisible address

Example .align 2

- two least significant bits (LSB) are equal to 00

Memory is addressed as byte array us usually (in C more precisely as array of chars)
The word of 32-bit processor is formed of 4-bytes in such case

Memory


## Align in Data Segment Filled by Assembler

## .data

.align 2 // or .align 4 on x86, use .p2align and .baling var1: .byte 3, 5,'A','P','0'
.align 2 // or .align 4 on x86, use .p2align and .baling

$$
\text { var2: .word } 0 \times 12345678 \text { // or . long on x86 }
$$

.align 3 // or .align 8 on x86, use .p2align and .baling var3: .2byte $1000 / /$ or .word on x86
$\qquad$

| BIG ENDIAN | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | $\mathbf{4}$ | $\mathbf{5}$ | $\mathbf{6}$ | $\mathbf{7}$ | $\mathbf{8}$ | $\mathbf{9}$ | $\mathbf{A}$ | $\mathbf{B}$ | $\mathbf{C}$ | $\mathbf{D}$ | $\mathbf{E}$ | $\mathbf{F}$ |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| $0 \times 2000$ | 3 | 5 | 41 | 50 | 4 F |  |  |  | 12 | 34 | 56 | 78 |  |  |  |  |
| $0 \times 2010$ | 10 | 00 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

var3
var1

| LITTE ENDIAN | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | $\mathbf{4}$ | $\mathbf{5}$ | $\mathbf{6}$ | $\mathbf{7}$ | $\mathbf{8}$ | $\mathbf{9}$ | $\mathbf{A}$ | $\mathbf{B}$ | $\mathbf{C}$ | $\mathbf{D}$ | $\mathbf{E}$ | $\mathbf{F}$ |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| $0 \times 2000$ | 3 | 5 | 41 | 50 | 4 F |  |  |  | 78 | 56 | 34 | 12 |  |  |  |  |
| $0 \times 2010$ | 00 | 10 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

var3 $\uparrow$

## C Language: Pointer

## \& (address operator)

Returns the lowest address in memory address space where space/cells allocated to store variable starts.
Example

$$
\begin{aligned}
& \text { int } y=5 ; \\
& \text { int }{ }^{* y P t r} ; \\
& \text { yPtr }=\text { \&y; }
\end{aligned}
$$

// yPtr is signed to y address
yPtr "points to" $\mathbf{y}$


## C Language: Pointer Operations

\& (address operator)
returns address of operand

* dereference address
returns value stored on address interpreted according to pointer type
* and \& are inverse
(but are not applicable in each case)

$$
\begin{gathered}
* \& m y V a r==~ m y V a r \\
\text { and } \\
\& * y P t r ~==~ y P t r
\end{gathered}
$$

## C Language: Size of Element Pointed by C Pointer

int * ptri;
char * ptrc;
double * ptrd;

```
*ptrx \equiv ptrx[0]
*(ptrx+1) \equiv ptrx[1]
*(ptrx+n) \equiv ptrx[n]
*(ptrx-n) \equiv ptrx[-n]
```

nr1 = sizeof (double);
nr2 = sizeof (double*);
nr1 != nr2

```
ptrd+1
```

$+$

## C Language: Pointer with const Qualifier

## int $x, y$;

int * lpio = \&y;
*lpio = 1; x=*lpio; lpio++;
const int * lpCio = \& y ;
*1pCio-1; x=*lpCio; IpCio++;
int * const lpioC = \&y;
*lpioC = 1; x=*lpioC; lpioct,
const int * const lpCioC = \&y;
*pCiot =1; $x=*$ lpCioC; tpeioct+;

## C Language and Pointers

| $\begin{aligned} & \text { int i; } \\ & \text { int } \times \mathrm{p} ; \\ & \mathrm{p}=\mathrm{qi;} \end{aligned}$ |
| :---: |
| $\begin{aligned} & i=i+1 ; \\ & * \mathrm{p}=\mathrm{p}+1 ; \\ & i++; \\ & (\mathrm{p})++; \\ & \mathrm{p}[0]++; \end{aligned}$ |



```
p++;
p=(int*) ((char*)p + sizeof(int));
```


## The Lecture and Real Programming Question

Quick Quiz 1.: Is the result of both code fragments a same?
Quick Quiz 2.: Which of the code fragments is processed faster and why?
A:
int matrix[M][N];
int i, j, sum $=0$;
for $(\mathrm{i}=0 ; \mathrm{i}<\mathrm{M} ; \mathrm{i}++$ )
for $(\mathrm{j}=0 ; \mathrm{j}<\mathrm{N} ; \mathrm{j}++$ )
sum += matrix[i][j];

B:
int matrix[M][N];
int i, j, sum = 0;
for $(j=0 ; j<N ; j++)$
for $(\mathrm{i}=0 ; \mathrm{i}<\mathrm{M} ; \mathrm{i}++$ )
sum += matrix[i][j];

Is there a rule how to iterate over matrix element efficiently?


[^0]:    ${ }^{1}$ Theoretical maximum bandwidth
    ${ }^{2}$ All SATA ports capable of $3 \mathrm{~Gb} / \mathrm{s}$. 2 ports capable of $6 \mathrm{~Gb} / \mathrm{s}$.
    Intel ${ }^{\circ}$ X79 Express Chipset Block Diagram

