1 Getting to know the device and platform

GPUs are designed to tackle tasks that can be expressed as data-parallel computations, such as:

genomics
data analytics
rendering pixels and verticies in graphics
video encoding and decoding
arithmetic operations eg matrix multiplications for neural networks

When getting to know the device, it can be helpful to have a quick look at PTX, SASS, warps, cooperative groups, Tensor Cores and memory hierarchy.

1.1 Abstractions

CUDA, or “Compute Unite Device Architecture” as it was introduced in 2006, is a parallel computing platform and programming model that uses the parallel engine in NVIDIA GPUs to solve computational tasks.

There are three principal abstractions:

hierarchy of thread groups
shared memories
barrier synchronization

Basically CUDA allows developers to partition problems into sub-problems that can be solved by threads running in parallel, in blocks. Threads all run the same code, and an ID for each thread allows access to memory addresses and control decisions.

Threads are arranged as a grid of thread blocks.

For a visualization of the specifications and use cases of some of today’s GPUs, please see this website.

1.2 PTX

PTX is a low-level, parallel thread execution virtual machine and instruction set architecture (ISA). In other words, it is a paradigm that leverages the GPU as a data-parallel computing device.

1.2.1 Programming model

PTX’s programming model is parallel: it specifies the execution of a given thread of a parallel thread array. A CTA, or cooperative thread array, is an array of threads that execute a kernel concurrently or in parallel.

1.3 SASS

SASS is the low-level assembly language that compiles to binary microcode, which executes natively on NVIDIA GPUs.

1.4 High level architecture

GPUs have highly parallel processor architecture, comprising processing elements and memory hierarchy. Streaming processors do work on data, and that data and code are accessed from the high bandwidth memory (HMB3 in the diagram) via the L2 cache.

The A100 GPU, for example, has 108 SMs, a 40MB L2 cache, and up to 2039 GB/s bandwidth from 80GB of HBM2 memory.

NVLink Network Interconnect enables GPU-to-GPU communication among up to 256 GPUs across multiple compute nodes.

1.5 Streaming multiprocessors

Each Streaming Multiprocessor has a set of execution units, a reguster file and some shared memory.

We also notice the warp scheduler - this is a basic unit of execution and a collection of threads. Typically these are groups of 32 threads, which are executed together by a SM.

Tensor Cores are specialized units focused on speeding up deep learning workloads. They accel at mixed-precision matrix multiply and gradient accumulation calculations.

1.6 TeraFlops

It’s worth familiarizing ourselves with TFLOPS, which stands for Trillion Floating Point Operations Per Second. This is commonly used to measure the performance of GPUs.

1 TFLOP = 1 trillion floating point calculations per second (what’s a floating point? Just a number with a decimal eg 1.2 or 12.3456)

The H100 with SXM% board form-factor can perform 133.8 TFLOPs on FP16 inputs. FP16 just means half precision, or 1 bit for sign (+, -), 5 bits for the exponent, and 10 bits for decimal precision. This is a very popular format for AI training and inference, since a minimal drop in accuracy also means better speed and memory efficiency.