## Overview of the Lecture **Parallel Programming** ■ Part 1 - Introduction to Parallel Programming Introduction Část I Jan Faigl Parallel Processing Part 1 – Introduction to Parallel Programming Katedra počítačů Fakulta elektrotechnická Semaphores České vysoké učení technické v Praze Shared Memory Přednáška 10 BAB36PRGA - Programování v C Messages Parallel Programming Motivation Why to Deal with Parallel Programming Process - Executed Program Process is executed program running in a dedicated memory space. Process is an entity of the Operating System (OS) that is schedule for independent Increase computational power. ■ The idea of parallel programming comes from the 60s with the first multi-program and Having multi-processor system we can solve the computational problem faster. pseudo-parallel systems. Efficient usage of the computational power. Process is usually in one of three basic states: Parallelism can be hardware or software based. Even a running program may wait for data. Executing – currently running on the processor (CPU); Hardware based – true hardware parallelism of multiprocessor systems. Blocked – waiting for the periphery; Software based – pseudo-parallelism. E.g., a usual program with user-interaction typically waits for the user input. ■ Waiting – waiting for the processor Simultaneous processing of many requests. ■ Pseudo-parallelism – A program with parallel constructions may run in pseudo-parallel Handling requests from individual clients in client/server architecture. A process is identified in the OS by its identifier, e.g., Process IDentificator PID. environment on single or multi-processor systems. Scheduler of the OS manage running processes to be allocated to the available proces-**Process States** Multi-processor Systems Possible Architectures for Parallel Executions Důvod čekání pominul Spuštění procesu Control of individual instructions. Připravené z vnější příčiny. ■ SIMD - Single-Instruction, Multiple-Data - same instructions are simultaneously procesy performed on different data. · "Processors" are identical and run synchronously. Multi-processor systems allow true parallelism. Přechod E.g., "Vectorization" such as MMX, SSE, 3Dnow!, and AVX, AVX2, etc. do čela fronty It is necessary to synchronize processors and support data communication. MIMD – Multiple-Instruction, Multiple-Data – processors run independently and připravených Čekající (blokované) Resources for activity synchronization. asynchronously. procesy Procesu je odňat procesor. Resources for communication between processors (processes) Memory Control Access. Systems with shared memory – central shared memory. Proces zažádal o službu, Aktivní na kterou musí čekat E.g., multi-core CPUs. proces Systems with distributed memory – each processor has its memory Proces zažádal o ukončení. E.g., computational grids. Proces zažádal o službu, kterou lze vyřídit okamžitě. Signal Generator and Visualization Massive parallelism using graphics cards Computational Power (2008) Let we have a signal generator sgen, which is a program that generate a sequence of ■ What is the reported processor computational power? values to its stdout. Graphics (stream) processors. Image rendering performed pixel-by-pixel can be easily parallelized. ■ The visualization can be realized in another application tsignal\_viewer that reads 96 GigaFLOPs Graphics Processing Units (GPU) has similar (or even higher) degree of integration signal values from the stdin. 102 GigaFLOPs with the main processors (CPU). GeForce 8800 GTX ■ We can connect these two applications by ./sgen | ./tsignal\_viewer. 518 GigaFLOPs (including texture units) Radeon HD 4670 480 GigaFLOPs They have huge number of parallel processors. (2023) GeForce RTX 4060 15 110 GigaFLOPs \$ ./sgen 0.000000 1.075095 1.031029 \$ ./sgen | wc E.g., GeForce GTX 1060 ~ 1280 cores. Main processors 65772 65772 618529 Peak catalogue values • The computational power can also be used in another applications. Phenom X4 9950 (@2.6 GHz) 21 GigaFLOPs 0.916540 \$ ./sgen | ./tsignal\_viewer Core 2 Duo E8600 (@3.3 GHz) 22 GigaFLOPs • Processing stream of data (SIMD instructions - processors). 0.748307 Cure 2 Quad QX9650 (@3.3 GHz) 35 GigaFLOPs GPGPU - General Purpose computation on GPU. http://www.gpgpu.org 0.342897 0.149521 Cure 2 Quad QX9650 (@3.3 GHz) 35 GigaFLOPs OpenCL (Open Computing Language) – GPGPU abstract interface. Core i7 970 (@3.2 GHz) 42 GigaFLOPs -0.016643 HW9B is an extension of the con--0.147374 CUDA - Parallel programming interface for NVIDIA graphics cards. Core i9-13900 (@2.00-5.60 GHz) 846 GigaFLOPs (2023) cept using user defined commu--0.147374 -0.239553 -0.293963 -0.314364 -0.306833 Test linpack 32-bit. http://www.nvidia.com/object/cuda\_home.html nication protocol between a sig-Is the reported power really achievable? (float vs double) nal generator and control applica-200 400 600 80 tion with visualization and multi-How about other indicators? -0.279048 E.g., computational power / power consumption -0.239311 threading programming. CSX700 has typical power consumption around 9W. -0.195373 Parallel Computing using GPU Parallel Computing using GPU Parallel Computing using GPI CUDA CUDA - Computational Model CUDA - Grid, Blocks, Threads, and Memory Access NVIDIA Compute Unified Device Architecture. Kernel (computation) is divided into blocks. Extension of the C to access to the parallel computational units of the GPU. Each block represent a parallel computation of the part of the result. Computation (kernel) is executed by the GPU. E.g., a part of the matrix multiplication. • Kernel is performed in parallel using available computational units. Each block consists of computational threads. Host - Main processor (process). Parallel computations are synchronization within the block. Device - GPU. Blocks are organized into the grid. Data must be in the memory accessible by the GPU. Scalability is realized by dividing the computation into blocks. Host memory → Device memory Blocks may not be necessarily computed in parallel. Based on the available number of ■ The result (of the computation) is stored in the GPU memory. Access time to the memory. Host memory ← Device memory Collisions for simultaneous access of several threads. Parallel Computing using GPU Parallel Computing using GPU Parallel Computing using GPU CUDA - Example - Matrix Multiplication 1/8 CUDA - Example - Matrix Multiplication 2/8 CUDA - Example - Matrix Multiplication 3/8 Naive implementation with transpose Naive implementation void simple\_multiply\_trans(const int n, const float \*a, const float \*b, float \*c) NVIDIA CUDA SDK - Version 2.0, matrixMul. float \* bT = create\_matrix(n); void simple\_multiply(const int n, Simple matrix multiplication. for (int i = 0; i < n; ++i) { const float \*A, const float \*B, float \*C) bT[i\*n + i] = b[i\*n + i]; $\mathbf{C} = \mathbf{A} \cdot \mathbf{B}$ for (int j = i + 1; j < n; ++j) { bT[i\*n + j] = b[j\*n + i];</pre> ■ Matrices have identical dimensions $n \times n$ . з { for (int i = 0; i < n; ++i) { C[i \* n + j] = prod; 10 11 12 13 } for (int j = 0; j < n; ++j) { float prod = 0: for (int k = 0; k < n; ++k) { prod += A[i \* n + k] \* B[k \* n + j]; Parallel Computing using GPU bT[j\*n + i] = b[i\*n + j]; for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { tmp += a[i\*n + k] \* bT[j\*n + k];</pre> for (int i = 0: i < n: ++i) { float tmp = 0; c[i\*n + j] = tmp; free(bT); 19 20 Parallel Computing using GPU Pipeline-based communication where n is the multiple of the block size. naive implementation in C (3× for loop), CUDA implementation. Hardware naive implementation in C with matrix transpose. CPU - Intel Core 2 Duo @ 3 GHz, 4 GB RAM, ■ GPU - NVIDIA G84 (GeForce 8600 GT), 512 MB RAM. Parallel Computing using GPU Parallel Computing using GPU CUDA - Example - Matrix Multiplication 4/8 CUDA - Example - Matrix Multiplication 5/8 CUDA – Implementation – main function void cuda\_multiply(const int n, const float \*hostA, const float \*hostB, float \*hostC) CUDA – computation strategy 3 const int size = n \* n \* sizeof(float); Divide matrices into blocks. 4 float \*devA, \*devB, \*devC; ■ Each block computes a single sub-matrix C<sub>sub</sub>. 6 cudaMalloc((void\*\*)&devA, size); ■ Each thread of the individual blocks cudaMalloc((void\*\*)&devB, size); computes a single element of $C_{sub}$ . s cudaMalloc((void\*\*)&devC, size); cudaMemcpy(devA, hostA, size, cudaMemcpyHostToDevice); cudaMemcpy(devB, hostB, size, cudaMemcpyHostToDevice); dim3 threads(BLOCK\_SIZE, BLOCK\_SIZE); // BLOCK\_SIZE == 16 14 dim3 grid(n / threads.x, n /threads.y); 16 // Call kernel function matrixMul 17 matrixMul<<<grid, threads>>>(n, devA, devB, devC); BLOCK\_SIZE cudaMemcpy(hostC, devC, size, cudaMemcpyDeviceToHost); Parallel Computing using GPU Parallel Computing using GPU CUDA - Example - Matrix Multiplication 7/8 CUDA - Example - Matrix Multiplication 8/8 Computational time (in milliseconds) CUDA source codes. Example - Dedicated source file cuda\_func.cu 1. Declaration of the external function. extern "C" { // declaration of the external function (cuda kernel) void cuda\_multiply(const int n, const float \*A, const float \*B, float \*C); 2. Compile the CUDA code to the C++ code. 1 nvcc --cuda cuda\_func.cu -o cuda\_func.cu.cc CUDA CUDA Naive Transp. Naive 3. Compilation of the cuda\_func.cu.cc file using standard compiler. 208 11 11 82 1104 6360 1628 235 304 35 33 Matlab 7.6.0 (R2008a): n=1104: A=rand(n,n); B=rand(n,n); tic; C=A\*B; toc 1264 9763 2485 BAB36PRGA - Přednáška 10: Parallel Programming 308 CUDA — Example — Matrix Multiplication 6/8 CUDA implementation — kernel function i \_\_global\_\_ void matrixMultint n, float\* A, float\* B, float\* C) { int the = blockfatx, int thy = blockfatx, y; int ta = threadfat.x; int ty = threadfat.y; int ta = BBCOK\_SIZE by; // Deginning of sub-matrix in the block int aBnd = aBegin + n - 1; //end of sub-matrix in the block int aEnd = aBegin, b = BLOCK\_SIZE by; int a = BBCOK\_SIZE, b += BLOCK\_SIZE \* by; a = aEnd; a += BLOCK\_SIZE, b += BLOCK\_SIZE \* n ) { \_\_shared\_\_ float As[BLOCK\_SIZE][BLOCK\_SIZE]; // shared memory within \_\_shared\_\_ float Bs[BLOCK\_SIZE][BLOCK\_SIZE]; // the block As[y][tx] = A[a + n + ty + xx]; // each thread reads a single element Bs[ty][tx] = B[b + n + ty + tx]; // each thread reads a semony \_\_syncthreads(); // synchronization, sub-matrix to the memory for (int k = 0; k < BLOCK\_SIZE; ++k) { // each thread computes Csub += As[ty][k] \* Bs[k][tx]; // the element in the sub-matrix } \_\_syncthreads(); int c = n \* BLOCK\_SIZE \* by + BLOCK\_SIZE \* bx; C[c + n \* ty + tx] = Csub; // write the results to memory Parallel Computing using GPU