CUDA Basics#
CUDA enables massively parallel computation on NVIDIA GPUs. Programs consist of
host code (CPU) that launches kernel functions executing on the device (GPU).
Threads are organized into blocks, and blocks form a grid. Each thread identifies
itself using blockIdx and threadIdx to map to data elements.
Error Checking Macros#
CUDA API calls return error codes that should always be checked. Silent failures
lead to hard-to-debug issues where kernels produce wrong results or memory
operations fail without notice. These macros wrap API calls with automatic error
checking, printing the file, line number, and error description before exiting.
The LAUNCH_KERNEL macro uses cudaLaunchKernelEx which provides more
control over launch configuration than the <<<>>> syntax.
#define CUDA_CHECK(exp) \
do { \
cudaError_t err = (exp); \
if (err != cudaSuccess) { \
fprintf(stderr, "[%s:%d] %s failed: %s\n", __FILE__, __LINE__, #exp, cudaGetErrorString(err)); \
exit(1); \
} \
} while (0)
#define LAUNCH_KERNEL(cfg, kernel, ...) CUDA_CHECK(cudaLaunchKernelEx(cfg, kernel, ##__VA_ARGS__))
// Usage
CUDA_CHECK(cudaMalloc(&d_data, size));
CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));
cudaLaunchConfig_t cfg = {};
cfg.gridDim = dim3(blocks, 1, 1);
cfg.blockDim = dim3(threads, 1, 1);
cfg.stream = stream;
LAUNCH_KERNEL(&cfg, my_kernel, d_data, n);
Hello World#
- Source:
The __global__ keyword declares a kernel function that runs on the GPU but
is callable from host code. The <<<blocks, threads>>> execution configuration
specifies how many thread blocks to launch and how many threads per block. Each
thread computes its unique global index using blockIdx.x * blockDim.x + threadIdx.x,
which maps threads to data elements. Without cudaDeviceSynchronize() or stream
synchronization, the host continues execution immediately after launching the
kernel, potentially accessing results before they’re ready.
__global__ void hello_kernel(int* output) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
output[idx] = idx;
}
// Launch: 2 blocks, 4 threads each = 8 total threads
hello_kernel<<<2, 4>>>(d_output);
cudaDeviceSynchronize(); // Wait for kernel to complete
Function Qualifiers#
CUDA provides function qualifiers to specify where functions execute and where they can be called from. Understanding these is essential for organizing code between host and device.
// __device__ - Device function: called from device, runs on device
__device__ float square(float x) { return x * x; }
// __host__ __device__ - Compiles for both host and device
__host__ __device__ float add(float a, float b) { return a + b; }
// __global__ - Kernel function: called from host, runs on device
__global__ void compute(float* out, float a, float b) {
out[0] = add(a, b); // Call __host__ __device__ function
out[1] = square(a); // Call __device__ function
}
Vector Addition#
- Source:
Vector addition demonstrates the most common CUDA pattern: one thread per data
element. Each thread independently computes one output value, achieving massive
parallelism with thousands of threads running simultaneously. The bounds check
if (i < n) is critical because the total number of threads (blocks × threads
per block) is typically rounded up to a multiple of the block size, creating
extra threads that would access memory out of bounds. The ceiling division
formula (n + threads - 1) / threads ensures enough blocks are launched to
cover all elements.
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) { // Bounds check is critical
c[i] = a[i] + b[i];
}
}
int threads = 256;
int blocks = (n + threads - 1) / threads; // Ceiling division
vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);
Memory Management#
- Source:
CUDA provides several memory allocation methods with different performance characteristics. Choosing the right one depends on access patterns and transfer requirements.
Device Memory - Standard GPU memory. Fastest for GPU computation but requires explicit transfers. Use for data that stays on GPU.
float* d_data;
cudaMalloc(&d_data, size);
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
cudaFree(d_data);
Pinned (Page-Locked) Memory - Host memory that cannot be paged out.
Required for cudaMemcpyAsync to truly overlap with computation. ~2x faster
transfers than pageable memory, but reduces available system memory.
float* h_pinned;
cudaMallocHost(&h_pinned, size);
cudaMemcpyAsync(d_data, h_pinned, size, cudaMemcpyHostToDevice, stream);
cudaFreeHost(h_pinned);
Host Registered Memory - Pins existing malloc’d memory. Useful when you can’t change allocation code but need async transfers. Has registration overhead, so avoid for short-lived buffers.
float* h_data = (float*)malloc(size);
cudaHostRegister(h_data, size, cudaHostRegisterDefault);
cudaHostUnregister(h_data);
free(h_data);
Unified Memory - Single pointer accessible from both CPU and GPU. Driver automatically migrates pages on demand. Simplifies code but may have performance overhead from page faults. Best for prototyping or irregular access patterns.
float* data;
cudaMallocManaged(&data, size);
kernel<<<blocks, threads>>>(data); // GPU access
cudaDeviceSynchronize();
data[0] = 1.0f; // CPU access
cudaFree(data);
Zero-Copy Memory - GPU directly reads/writes host memory over PCIe. No explicit transfers needed, but each access has PCIe latency. Good for data accessed once or when GPU memory is limited. Poor for repeated access.
float* h_mapped;
float* d_mapped;
cudaHostAlloc(&h_mapped, size, cudaHostAllocMapped);
cudaHostGetDevicePointer(&d_mapped, h_mapped, 0);
kernel<<<blocks, threads>>>(d_mapped);
cudaFreeHost(h_mapped);
Summary (64 MB benchmark, run cuda-memory-benchmark to test on your system):
Method |
H2D Speed |
D2H Speed |
Use Case |
|---|---|---|---|
Pageable (malloc) |
~10.6 GB/s |
~1.3 GB/s |
Simple, non-critical |
cudaMallocHost |
~13.3 GB/s |
~2.0 GB/s |
Async transfers |
cudaHostRegister |
~13.2 GB/s |
~2.1 GB/s |
Pin existing memory |
cudaMallocManaged |
~0.8 GB/s |
~0.4 GB/s |
Prototyping, irregular |
- Source:
RAII Pattern - Ensures proper cleanup:
struct CudaBuffer {
float* h_data;
float* d_data;
cudaStream_t stream;
CudaBuffer(size_t size) {
cudaStreamCreate(&stream);
cudaMallocHost(&h_data, size);
cudaMalloc(&d_data, size);
}
~CudaBuffer() {
cudaFree(d_data);
cudaFreeHost(h_data);
cudaStreamDestroy(stream);
}
};
Parallel Reduction#
- Source:
Reduction is a fundamental parallel pattern for computing aggregates like sum,
min, max, or product across large arrays. The tree-based approach repeatedly
halves the number of active threads: in each iteration, thread i combines
its value with thread i + stride. Shared memory holds intermediate results
within each block, and __syncthreads() ensures all threads complete each
level before proceeding to the next. After the kernel, each block has produced
one partial result, which can be combined on the CPU or with a second reduction
kernel. The block size should be a power of two for this simple implementation.
#define BLOCK_SIZE 256
__global__ void reduce_sum(const float* input, float* output, int n) {
__shared__ float sdata[BLOCK_SIZE];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = (i < n) ? input[i] : 0.0f;
__syncthreads();
// Tree reduction: 128, 64, 32, 16, 8, 4, 2, 1
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) {
output[blockIdx.x] = sdata[0];
}
}
Streams#
- Source:
CUDA streams are independent execution queues that enable concurrent operations.
Operations within a stream execute in order, but operations in different streams
can overlap. This allows hiding memory transfer latency by pipelining: while one
chunk of data is being processed by a kernel, the next chunk can be transferred
to the GPU, and the previous chunk’s results can be copied back to the host.
Pinned memory (cudaMallocHost) is mandatory for true asynchronous transfers;
pageable memory forces synchronous behavior even with cudaMemcpyAsync. The
optimal number of streams depends on the workload and hardware, but 2-4 streams
typically provide good overlap without excessive scheduling overhead.
constexpr int num_streams = 4;
cudaStream_t streams[num_streams];
for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
}
// Pipeline: each stream handles a chunk independently
for (int i = 0; i < num_streams; i++) {
int offset = i * chunk_size;
cudaMemcpyAsync(d_data + offset, h_data + offset, chunk_bytes, cudaMemcpyHostToDevice, streams[i]);
kernel<<<blocks, threads, 0, streams[i]>>>(d_data + offset, chunk_size, i);
cudaMemcpyAsync(h_data + offset, d_data + offset, chunk_bytes, cudaMemcpyDeviceToHost, streams[i]);
}
for (int i = 0; i < num_streams; i++) {
cudaStreamSynchronize(streams[i]);
cudaStreamDestroy(streams[i]);
}