from numba import jit, cuda
import math
import numpy as np
@cuda.jit
def multiply_kernel(x, y, out):
= cuda.grid(1)
i if i < x.shape[0]:
# Perform the addition
= x[i] * y[i]
result = result out[i]
3 Python and CUDA
One of the most popular libraries for writing CUDA code in Python is Numba, a just-in-time, type-specific, function compiler that runs on either CPU or GPU.
There is also pyCUDA, which requires writing C code within Python and generally requires more modifications than Numba. We will focus on Numba since it is easier to begin with.
= 3200
n = np.arange(n).astype(np.int32)
x print(x)
[ 0 1 2 ... 3197 3198 3199]
= np.full_like(x, 10)
y print(y)
[10 10 10 ... 10 10 10]
= cuda.to_device(x)
d_x = cuda.to_device(y)
d_y = cuda.device_array_like(d_x) d_out
= 256
threads_per_block # math.ceil rounds up to the nearest integer, ensuring we have enough blocks to cover all elements.
= math.ceil(n / threads_per_block)
blocks_per_grid print(f"threads per block: {threads_per_block}, blocks per grid: {blocks_per_grid}")
threads per block: 256, blocks per grid: 13
multiply_kernel[blocks_per_grid, threads_per_block](d_x, d_y, d_out)
cuda.synchronize()
# Copy the result back to the host
= d_out.copy_to_host()
result
print("First 10 results:", result[:10])
print("Last 10 results:", result[-10:])
First 10 results: [ 0 10 20 30 40 50 60 70 80 90]
Last 10 results: [31900 31910 31920 31930 31940 31950 31960 31970 31980 31990]
/usr/local/lib/python3.10/dist-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 13 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
3.1 Mandelbrot set
The Mandelbrot set is a two-dimensional, geometric representation of a fractal and can create visually stunning images.
In the following code, we see
c_real = min_x + (max_x - min_x) * col / width
c_imag = min_y + (max_y - min_y) * row / height
These lines map pixel coordinates to a point in a complex plane.
We iterate using the function f(z) = z^2 + c
, where z and c are complex numbers.
x = 0 # Real part of z
y = 0 # Imaginary part of z
for i in range(max_iters):
if x*x + y*y > 4.0:
break
x_new = x*x - y*y + c_real
y = 2*x*y + c_imag
x = x_new
By checking if the absolute value of z is greater than 2. If |z| > 2, we know it will escape to infinity.
With, x_new = x*x - y*y + c_real
we calculate the real part of z^2 + c. For a complex number z = a + bi, the real part of z^2 is a^2 - b^2.
y = 2*x*y + c_imag
: This calculates the imaginary part of z^2 + c.
The imaginary part of z^2 is 2ab.
Convergence Test:
The Mandelbrot set is defined as the set of points c in the complex plane for which the function f(z) = z^2 + c does not diverge when iterated from z = 0. In practice, we use a finite number of iterations (max_iters) and check if |z| <=
If the iteration reaches max_iters without |z| exceeding 2, we consider the point to be in the Mandelbrot set. The number of iterations before |z| > 2 determines the color of points outside the set, creating the characteristic fractal patterns.
output[row, col] = i
stores the number of iterations it took for the point to escape – this value is used for coloring the plot.
The Mandelbrot set can be useful for learning CUDA concepts since we can parallelize across threads and use memory access patterns, as well as GPU-CPU interaction. The calculations are done on the GPU, then sent back to the CPU for the display plot.
import numpy as np
from numba import cuda
import matplotlib.pyplot as plt
# CUDA kernel function
@cuda.jit
def mandelbrot_kernel(min_x, max_x, min_y, max_y, width, height, max_iters, output):
# Get the 2D thread position within the grid
= cuda.grid(2)
row, col
# Check if the thread is within the image bounds
if row < height and col < width:
= min_x + (max_x - min_x) * col / width
c_real = min_y + (max_y - min_y) * row / height
c_imag
# Mandelbrot set iteration
= 0
x = 0
y for i in range(max_iters):
# Check if the point has escaped
if x*x + y*y > 4.0:
break
# Update x and y
= x*x - y*y + c_real
x_new = 2*x*y + c_imag
y = x_new
x
# Store the number of iterations in the output array
= i
output[row, col]
# Set up and launch kernel
def plot_mandelbrot(min_x, max_x, min_y, max_y, width, height, max_iters):
# Output array
= np.zeros((height, width), dtype=np.uint32)
output # Thread block and grid dimensions
= (16, 16)
threads_per_block = ((width + threads_per_block[0] - 1) // threads_per_block[0],
blocks_per_grid + threads_per_block[1] - 1) // threads_per_block[1])
(height
# Launch kernel
mandelbrot_kernel[blocks_per_grid, threads_per_block](
min_x, max_x, min_y, max_y, width, height, max_iters, output
)
# Plot image
='hot', extent=[min_x, max_x, min_y, max_y])
plt.imshow(output, cmap
plt.colorbar()'Mandelbrot Set')
plt.title(
plt.show()
# Example usage
-2, 1, -1.5, 1.5, 1000, 1000, 100)
plot_mandelbrot(
# Things to try: Modify the code to zoom into an interesting part of the Mandelbrot set
# Hint: Adjust the min_x, max_x, min_y, max_y parameters
/usr/local/lib/python3.10/dist-packages/numba/cuda/cudadrv/devicearray.py:888: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
warn(NumbaPerformanceWarning(msg))