# **Advanced Computer Architectures**

## GPU (Graphics processing unit), GPGPU (General-purpose **computing** on GPU; General-purpose GPU) and GPU **Computing**



Czech Technical University in Prague, Faculty of Electrical Engineering Slides authors: Michal Štepanovský, update Pavel Píša

- **Tianhe-1A** Chinese Academy of Sciences' Institute of Process Engineering (CAS-IPE)
- Molecular simulations 110 milliards atomd (1.87 / 2.507 PetaFLOPS)
- Rmax: 2.56 Pflops, Rpeak: 4,7 PFLOPS
- **7 168** Nvidia Tesla M2050 (448 Thread processors, 512 GFLOPS FMA)
- **14 336** Xeon X5670 (6 cores / 12 threads)
- "If the Tianhe-1A were built only with CPUs, it would need more than 50,000 CPUs and consume more than 12MW. As it is, the Tianhe-1A consumes 4.04MW."

http://www.zdnet.co.uk/news/emerging-tech/2010/10/29/china-builds-worlds-fastest-supercomputer-40090697/ which leads to 633 GFlop/kWatt (K Computer - 830 GFlop/kWatt)

- It uses oven interconnection network: <u>Arch</u>, 160 Gbps
- There are already three supercomputers utilizong graphics cards in China, Tianhe-1 (AMD Radeony HD 4870 X2), Nebulae (nVidia Tesly C2050) and Tianhe-1A
- <u>http://i.top500.org/system/176929</u>
- <u>http://en.wikipedia.org/wiki/Tianhe-I</u>

# Tianhe-2

- 33.86 PFLOPS
- Power consumption 17 MW
- Kylin Linux



- 16,000 nodes, each build from 2 Intel Ivy Bridge Xeon processory a 3 Intel Xeon Phi koprocesory (61 jader) = 32000 CPU a 48000 coprocessors, 3 120 000 cores in total
- Fortran, C, C++, Java, OpenMP, and MPI 3.0 based on MPICH
- A broadcast operation via MPI was running at 6.36 GB/s and the latency measured with 1K of data within 12,000 nodes is about 9 us
- directive based intra-node programming model by OpenMC (in progress) – instead of Open-MP, CUDA, OpenACC, or OpenCL

# Sunway TaihuLight

- 93 PFLOPS (LINPACK benchmark), peak 125 PFLOPS
- Interconnection 14 GB/s, Bisection 70 GB/s
- Memory 1.31 PB, Storage 20 PB
- 40,960 SW26010 (Chinese) total 10,649,600 cores
- SW26010 256 processing cores + 4 management
- 64 KB of scratchpad memory for data (and 16 KB for instructions)
- Sunway RaiseOS 2.0.5 (Linux based)
- OpenACC (for open accelerators) programming standard
- Power Consumption 15 MW (LINPACK)



# Sunway TaihuLight

- Core group
  - Management Processing Element (MPE)
  - 64 Computing Processing Elements (CPEs)
- 4 core groups on SW26010 chip
- 8 floating point operations per cycle per CPE core (64 bit), 16 MPE



Source: Report on the Sunway TaihuLight System, Jack Dongarra, University of Tennessee, Oak Ridge National Laboratory

#### More accurate results

Multiply-Add (MAD): Product А × в = (truncate extra digits) + Result С = Fused Multiply-Add (FMA) Product Α в = (retain all digits) × Result С =

• **API** (Application Programming Interface): OpenGL, DirectX – the GPU can be considered in such applications as coprocessor of main CPU

• Estonia Donates Project: Our <u>GPGPU</u> supercomputer is <u>GPU</u>-based massively parallel machine, employing more than thousand parallel streaming processors. Using GPU-s is very new technology, very price- and cost-effective compared to old CPU solutions. Performance (currently):

6240 streaming processors + 14 CPU cores 23,2 arithmetic TFLOPS (yes, 23 200 GFLOPS)

http://estoniadonates.wordpress.com/our-supercomputer

#### • Supermicro: <u>2026GT-TRF-FM475</u>

- 2x Quad/Dual-Core Intel® Xeon® processor 5600/5500 series
- Intel® 5520 chipset with QPI up to 6.4 GT/s + PLX8648
- Up to 96GB of Reg. ECC DDR3 DIMM SDRAM
- FM475: 4x NVIDIA Tesla M2075 Fermi GPU Cards
- FM409: 4x NVIDIA Tesla M2090 Fermi GPU Cards

http://www.supermicro.com/GPU/GPU.cfm#GPU\_SuperBlade

 "FASTRA: the world's most powerful desktop supercomputer" We have now developed a PC design that incorporates 13 GPUs, resulting in a massive 12TFLOPS of computing power. <a href="http://fastra2.ua.ac.be/">http://fastra2.ua.ac.be/</a>



#### CPU vs. GPU



Nvidia: "GPU computing is possible because today's GPU does much more than render graphics: It sizzles with a teraflop of floating point performance and crunches application tasks designed for anything from finance to medicine." Source: www.nvidia.com

Performance metrics – do you remember?

- Let's {  $R_i$  } are execution speeds of different programs i = 1, 2, ..., m measured in MIPS (MFLOPS), or IPS (FLOPS)
- <u>The arithmetic mean performance</u>:  $R_a = \sum_{i=1}^{m} \frac{R_i}{m} = \frac{1}{m} \sum_{i=1}^{m} R_i$

 $R_a$  is equally weighted (1/m) in all programs and is proportional to the sum of the IPC, but not the sum of execution times (inversely proportional). Arithmetic mean (average) not generally usable:

$$\begin{split} R_{a} &= \frac{1}{2} \left( R_{1} + R_{2} \right) = \frac{1}{2} \left( \frac{IC_{1}}{T_{1}} + \frac{IC_{2}}{T_{2}} \right) = \frac{1}{2} \left( \frac{IC_{1}}{IC_{1}.CPI_{1}T_{CLK}} + \frac{IC_{2}}{IC_{2}.CPI_{2}T_{CLK}} \right) = \\ &= \frac{1}{T_{CLK}} \left( \frac{IPC_{1} + IPC_{2}}{2} \right) = \frac{1}{T_{CLK}} \left( \frac{IC_{1}}{2C_{1}} + \frac{IC_{2}}{2C_{2}} \right) \quad but \quad IPC_{1,2} = \frac{IC_{1} + IC_{2}}{C_{1} + C_{2}} \end{split}$$
Only iff  $C_{1} = C_{2}$  (total number of cycles of both programs is equal) then

- R<sub>a</sub> is usable
- In praxis: The arithmetic mean of execution speed of two (or more) different programs is not related to overall execution speed! Not usable!
   B4M35PAP Advanced Computer Architectures

### Performance metrics – do you remember?

<u>The geometric mean:</u>  $R_g = \prod_{i=1}^m R_i^{\frac{1}{m}}$ 

It does not summarize real performance. It has no inverse relation to overall execution time of all programs. Usable only for comparison with normalized results to reference compute. m

The harmonic mean:  $R_{h} = \frac{1}{\sum_{i=1}^{m} \frac{1}{R_{i}}}$   $R_{h} = \frac{2}{\frac{1}{R_{1}} + \frac{1}{R_{2}}} = \dots = \frac{1}{T_{CLK}} \left(\frac{2}{CPI_{1} + CPI_{2}}\right) = \frac{1}{T_{CLK}} \frac{2IC_{1}IC_{2}}{C_{1}IC_{2} + C_{2}IC_{1}}$ 

Only iff  $IC_1 = IC_2$  (both programs are of the same size)then  $R_h$  is usable

• There exist even <u>weighted</u> versions of these performance metrics.

### **3D** graphics pipeline

- it is a way of processing graphic data to achieve an image (the input is representation of a 3D scene, output is 2D image)
- Next phases of geometric/graphics data processing:
  - transformations (scaling, rotations, translation,..) matrix multiplication
  - lighting (only vertexes) dot products of vectors
  - projection transformations (into camera 3D coordinates) matrix multiplication,
  - clipping, rasterization and a textures mapping (pixels from this stage)





What is important for us -> HW support required – GPU development

- What was original solution?
  - :narrowly specialized single-purpose HW according to the principle of 3D graphic pipeline:
  - vertex shader (3D model manipulation, vertexes lightening),
  - geometry shader (adds/removes vertexes,..)
  - pixel shader (more precise: fragment shader) (input is rasterization output; defines collor of "pixel" (fragment) - textures..)
  - ROP unit (create pixel from pixel fragments, optimizes image for view) ROP – Raster OPerator, (ATI: Element Render Back-End)



- Today concept (and future directions)?
  - HW function in each phase much more flexible, programmable (not only the computation operation "program", but even support of controlflow primitives)
  - 16, 24, 32, 64 floating point precision supported today
  - unified shaders (each can be used for all functions..) (ATI Xenos, GeForce 8800) – advantage?
    - low detail scene (vertex shader vs. <u>pixel shader</u>)
    - highly detailed scene (<u>vertex</u> shader vs. pixel shader)

• What it means for us / what are benefits?





B4M35PAP Advanced Computer Architectures

• The Shader Unification Principle - Architecture provides one large set of data paralleled floating point processors generic enough to replace the functions of individual shaders



#### GPU - GeForce 8800



#### GeForce 8800 – hardware limits

- 512 threads in one block
- 8 blocks on one SM (Streaming Multiprocessor)
- 768 threads on one SM > 768x16=12 288 threads in total!
- 128 threads simultaneously running
- 16 384 bytes shared cache for one SM
- two threads from different blocks cannot cooperate together



#### GPU - GeForce 7800 – for comparison



#### GPU - GeForce 7800



**B4M35PAP Advanced Computer Architectures** 

### CUDA (Compute Unified Device Architecture)



- Kernel part of application running on GPU
- Kernel is executed on Gridu
- Grid latice of threads blocks
- Thread block grup of threads starting at same address and communicationg through shared memory and synchronization barriers (<=512)
- One block to one processor (Streaming Multiprocessor - SM)
- One thread in same block to one execution unit (Streaming Processor core - SP core)

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=2

### CUDA (Compute Unified Device Architecture)



CUDA memory model:

- Registers and shared memory on chip
- Local memory frame buffer
- Constant Mem and Texture Mem – frame buffer, only for reading, kcached on chip, coherence?
- Global memory

red = fast (on chip) orange = slow (DRAM)

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=3

- CUDA C is based on C language with extensions but even some limitations
- Kernel specified by \_\_global\_
- each thread executing kernel is assigned by unique ID

```
// Definition of vector addition function:
            void VecAdd(int n, float* A, float* B, float* C)
{
    for(int i=0; i<n; i++)
        C[i] = A[i] + B[i];
}
int main()
{
    ...
    // Sum of vectores of length N:
    VecAdd(N, A, B, C);
```

- CUDA C is based on C language with extensions but even some limitations
- Kernel specified by \_\_global\_
- each thread executing kernel is assigned by unique ID

```
// Kernel definition
  _global___ void VecAdd(int n, float* A, float* B, float* C)
{
  int i = threadIdx.x;
  if(i<n) C[i] = A[i] + B[i];
}
int main()
{
  // Execute N threads onGPU - soucet vektoru delky N:
  VecAdd<<<1, N>>>(N, A, B, C); <<<number of blocks, number of threads>>>
```

- To support native support for vector, 2D and 3D matrices, variable *threadIdx* is implemented as 3-components vector
- For 2D block of dimensions (Dx, Dy), thread on position (x,y) has its ID (x + y Dx)
- For 3D block of dimensions (Dx, Dy, Dz), thread on position (x,y,z) has its id ID (x + y Dx + z Dx Dy)

```
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}
```

```
int main()
{
    ...
    // Execute kernel as block of dimensions N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock(N, N);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
```

Number of therads inside block is limited. All therads belonging to same block are executed on same processor (SM) and share limited memory resources.

Today single block of threads can contain limited number of threads where today limit is usually 1024 threads.

```
_global___ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i < N \&\& j < N)
  C[i][j] = A[i][j] + B[i][j];
}
int main()
{
  dim3 threadsPerBlock(16, 16);
  dim3 numBlocks(N /threadsPerBlock.x, N/ threadsPerBlock.y);
  MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
```

#### CUDA



The fundamental condition is that blocks are required to be executable indepenentdly (executed in arbitrary order, simultaneously/parallel or sequential/serial)

Support for many other usable functions:

- cudaMalloc(), cudaMallocPitch(), cudaMalloc3D()
- cudaFree()
- cudaMemcpy(), cudaMemcpy2D() , cudaMemcpy3D()
- dimBlock(), dimGrid()
- atd.

**Declarations:** 

\_\_global\_\_ void KernelFunc(...); //kernel function, runs on device \_\_device\_\_ int GlobalVar; //variable in device memory \_\_shared\_\_ int SharedVar; //variable in per-block shared memory Speciální proměnné: dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;

#### Incrementing large array of elements – CUDA

```
#include <stdio.h>
 global__ void inkrement(int* out, int* in) {
 int idx = blockDim.x * blockIdx.x + threadIdx.x;
 out[idx] = in[idx]+1;
}
int main (int argc, char** argv)
{
 int* num h;
                         // pointer to array
                          // pointer to array in global memory
 int* num d;
 int* num out d; // pointer to output array in global memory
 size t num_size = 128*512; // size of array (too high for single block of threads)
                                                   // number of threads in one block
 int num_threads_per_block = 128;
 int num_blocks = num_size/num_threads_per_block; // lattice size
 size t num size bytes = sizeof (int)*num size; // array size in bytes
 num_h = (int*)malloc (num_size_bytes);
 cudaMalloc ((void**) &num_d, num_size_bytes); // global memory allocation
 cudaMalloc ((void**) &num out d, num size bytes); // global memory allocation
 for (unsigned int i = 0; i < num_size; i++) {</pre>
   num_h[i] = i;
  }
 cudaMemcpy (num_d, num_h, num_size_bytes, cudaMemcpyHostToDevice);
 inkrement<<<num_blocks, num_threads_per_block>>> (num_out_d, num_d);
 cudaThreadSynchronize();
 cudaMemcpy (num_h, num_out_d, num_size_bytes, cudaMemcpyDeviceToHost);
 cudaFree(num d); cudaFree(num out d); free(num h);
 return 0;
}
```

### **CUDA Unified Memory**

- Introduced 2014 with CUDA 6 and the Kepler cudaError\_t cudaMallocManaged(void\*\* ptr, size\_t size);
- pre-Pascal GPUs (Tesla K80) allocates memory on the GPU
- Pascal, Volta, ... pages can migrate to any processor's memory, populated with pagetables on demand
- cudaMemPrefetchAsync(ptr, length, destDevice, stream)
- cudaMemAdvise(ptr, length, advice, device) cudaMemAdviseSetReadMostly, cudaMemAdviseSetPreferredLocation, cudaMemAdviseSetAccessedBy
- Pascal and later NVLINK supports native atomics in hardware PCI-E will have software-assisted atomics

#### Nvidia Pascal based TESLA P100



- 3584 CUDA cores
- 4.7 FP64 TFLOPS
   9.3 FP32 TFLOPS
- 160 GB/s NVLink



Source: Tesla Volta / DGX-1v by Ralph Hinsche

#### Nvidia Volta Based TESLA V100

- 5,120 CUDA cores
- 640 NEW Tensor cores
- 7.5 FP64 TFLOPS
   15 FP32 TFLOPS
- 120 Tensor TFLOPS
- 20MB SM RF 16MB Cache 16GB HBM2 @ 900 GB/s
- 300 GB/s NVLink



Source: Tesla Volta / DGX-1v by Ralph Hinsche

# OpenCL

- Is CUDA C only possibility to accelerate computations?
- OpenCL The open standard for parallel programming of heterogeneous systems

```
void VecAdd(int n, float* A, float* B, float* C)
{
    for(int i=0; i<n; i++)
        C[i] = A[i] + B[i];
}</pre>
```

```
OpenCL:
kernel void VecAdd(global const float* A, global const
   float* B, global const float* C)
{
   int i= get_global_id(0);
   C[i] = A[i] + B[i];
}
```

### Jacket

#### Matlab support

```
A = gdouble(B); % to push B to the GPU from the CPU
B = double(A); % to pull A from the GPU back to the CPU
```

```
X = gdouble( magic( 3 ) );
Y = gones( 3, 'double' );
A = X * Y
```

```
GPU_matrix = gdouble( CPU_matrix );
GPU_matrix = fft( GPU_matrix );
CPU_matrix = double( GPU_matrix );
```

#### Goose

Controlled by compiller directives

```
#pragma goose parallel for loopcounter(i, j)
for (i = 0; i < ni; i++)
  for (j = 0; j < nj; j++)
    for (k = 0; k < 3; k++)
        C[k] = A[j][k] - B[i][k];</pre>
```

Also available other frameworks:

- PGI Accelerator
- CAPS HMPP
- Ct od Intelu
- Brook stream programming language (Stanford University)
- Podpora: Java, Python, C++, .NET, Mathematica

### Is today support ideal?

- Control flow instructions are executed in SIMD manner across all threads of one warp. Divergent branch results in creation of two separate threads groups which are executed sequentially. Explicit synchronization point (reconvergence point) can help to increase throughput.
- Memory "intensity" of memory accesses (mainly global)
- Data sharing intra-threads communication
- It is necessary to consider the time spent on the efforts made to achieve maximum throughput (optimization) vs. time obtained by optimizing itself ...

### Applications



KRÜGER J., BÜRGER K., WESTERMANN R.: Interactive screen-space accurate photon tracing on GPUs. In *Eurographics Symposium on Rendering (June*2006), pp. 319–329.

# Applications

- linear algebra
- basic and partial differential equations (head conduction, fluids flow, stress of mechanical structures, vibrations,..)
- signal processing,
- images processing,
- analysis of chemical compounds, drug search
- evolutionary and genetic algorithms
- optimizations
- neural networks
- •

#### Neural network on CPU



#### Neural network on GPU - CUDA

Time [ms] of evaluation of 100 networks with 'n' neurons for 50 time steps (average of 1000 runs) (Client – 'MacBook Pro', Server – 'PC', 100Mbit Ethernet)



Source of data: Zdeněk Buk