GPUs

Overview

Teaching: 35 min
Exercises: 25 min
Questions
  • What are GPUs and how do we access them?

  • How can we use a GPU with Numba?

  • How can we use a GPU in Pandas, Numpy or SciKit Learn?

Objectives
  • Understand what the difference between a GPU and CPU is and the performance implications

  • Apply Numba to use a GPU

  • Understand that there are GPU enabled replacements for many popular Python libraries

  • Recall that NVIDIA GPUs can be programmed in CUDA, but this is a very low level operation

What are GPUs and why should we use them?

How can you access a GPU if your PC doesn’t have one

Many laptops and desktops won’t have very powerful GPUs, instead we’ll want to use HPC or Cloud systems to access a GPU. If you don’t have access to any services which offer one then you can use Google Colab (https://colab.research.google.com). This offers a Jupyter notebook interface with GPUs for free, but the GPUs aren’t very powerful. You can also pay for Google Colab and get access to faster GPUs.

Orchid

JASMIN has a cluster called Orchid which has 16 nodes with 72 NVIDIA A100 GPUs between them. These are accessed via the Slurm batch scheduler. For more experimental work there are some A100 GPUs attached to the JASMIN notebook service. To use these you must be granted access to Orchid and select the GPU option when connecting.

Checking what GPUs are available to us

Systems with NVIDIA GPUs usually have a command called nvidia-smi installed, this will tell us some information about the GPUs that are attached to the system. We can invoke this either from a Jupyter terminal or in a notebook with the ! prefix.

nvidia-smi

On the JASMIN notebooks service this will return something similar to this.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:41:00.0 Off |                   On |
| N/A   34C    P0             88W /  400W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off |   00000000:C1:00.0 Off |                   On |
| N/A   30C    P0             85W /  400W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    5   0   0  |              13MiB /  9856MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    5   0   0  |              13MiB /  9856MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

There are two keys parts to the output here. The top part show’s we have two NVIDIA A100 cards with 40GB of GPU RAM each. However we don’t have exclusive use of these and they have been partitioned into smaller virtual GPUs known as MIGs (multi instance GPUs). Each of these only has 16GB of GPU RAM and we’re restriucted to just 10GB of that. This is still more than most desktop GPUs and is sufficient for many tasks. Where we might need more memory we will have to move our code over to a regular Python script running on Orchid’s batch system.

Checking the GPUs available to use from Python

The numba library provides an interface to Cuda, which is NVIDIA’s low level library for GPU operations. To get a list of GPUs we can just call cuda.detect().

from numba import cuda

cuda.detect()

Check what GPUs you have access to

Ensure that you have Cuda installed, this can installed by adding the cupy and cudatoolkit packages to your Conda/Mamba environment. Use Numba/Cuda to check what version of Cuda you have installed and what GPUs you have available.

Solution

mamba install -p ~/.conda/envs/esces cupy cudatoolkit
from numba import cuda
print(cuda.__version__)
cuda.detect()

Using GPUs

NVIDIA have drop in replacements for Pandas, Numpy and SciKit learn that are GPU accelerated. The replacemnt for NumPy and SciPy is known as CuPy.

Let’s do a calculation using NumPy.

import numpy as np
a = np.random.random(100_000_000)
result_np = np.mean(a)

Now let’s try and do the same thing with CuPy. We’ll use the same array we just created and copy it to the GPU, CuPy’s asarray function takes in a NumPy array and converts it to a CuPy array.

import cupy as cp
b = cp.asarray(a)
result_cp = cp.mean(b)

Let’s time how long this is taking, for the NumPy code we can use %time or %timeit. Unfortunately %time and %timeit don’t work properly with GPUs as calling a GPU function returns immediately and the code continues to run on the GPU. So we’ll have to take a different approach to measuring the time taken and use CuPy’s built-in profiler which includes a function called benchmark. timeit will automatically decide how many runs to do, but defaults to 7, whereas benchmark needs to be told how many times to repeat with the n_repeat parameter.

from cupyx.profiler import benchmark
gpu_times = benchmark(cp.mean, (b,), n_repeat=7)
print(gpu_times)

To ensure we make a comparable run using timeit we can us it’s -n and -r options to control how many times it runs too.

%timeit -n 1 -r 7 result_np = np.mean(a)

On the JASMIN notebook service this gives the following output for CuPy:

mean                :    CPU:    54.157 us   +/-  3.656 (min:    50.439 / max:    61.728) us     GPU-0:  2158.267 us   +/-  3.639 (min:  2154.080 / max:  2165.024) us

and for NumPy:

66.2 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 7 loops each)

So we have used about 2ms of GPU time (plus 54us of CPU time) to take the mean of 100,000,000 numbers on the GPU and 66ms on the CPU, so that’s a 33 fold speedup! Although we did not include the time spent copying the array to the GPU.

Measuring the time taken to copy data to the GPU memory

As previously mentioned we can’t use %timeit or %time to measure how long GPU operations take. We could use benchmark to do this, but we lose the return of the function so this won’t actually be usable (without a subsequent call). An alternative is to record the time of the system clock before and after the asarray call, but for this to work we must synchronize with the GPU after the asarray call to ensure it’s really finished.

import time
t0 = time.time()
b = cp.asarray(a)
cp.cuda.stream.get_current_stream().synchronize()
print(str((time.time() - t0) * 1_000_000) + "us")

On JAMSIN this is taking around 100,000us or 100ms. So we need to add this to the computation time which was just 2ms. As the CPU version only took 66ms it is actually faster to do this calculation on the CPU. But this is a very simple example where we’ve only done one very simple operation on quite a large amount of data.

Create random numbers with CuPy

So far we have created random numbers using NumPy on the CPU and copied these to the GPU. A more efficient way to do this might be to make the random numbers of the GPU. Adjust the code to use CuPy to create 1,000,000 random numbers. Use the time library (or Cupyx’s benchmark) to measure how long this takes. Is this quicker than making the random numbers on the CPU?

Solution

import cupy as cp
import time

t0 = time.time()
b = cp.random.random(100_000_000)
cp.cuda.stream.get_current_stream().synchronize()
print(str((time.time() - t0)*1_000) + "ms")

#alternative using benchmark
from cupyx.profiler import benchmark
benchmark(cp.random.random, (100_000_000,), n_repeat=7)

cp.mean(b)

Using GPUs with Numba

Numba code can be converted to run on a GPU using the @cuda.jit decorator, which is similar to the @jit decorator we saw earlier on. However there are a few alterations that code might need first due to the way GPUs operate. Firstly the functions we use on the GPU can’t return anything, we must instead have an extra parameter which contains an array where we will save any results.

Here is an example that is similar to the function we used the JIT with earlier on with the CPU.

@cuda.jit
def sum_arr_example(a, r):
    total = 0.0
    for i in range(a.shape[0]):
        total += a[i,i]
    r[0] = total

a = cp.arange(10_000).reshape(100,100)
result_gpu = cuda.device_array((1,), np.float64)

sum_arr_example[1, 1](a, result_gpu)

result_host = result_gpu.copy_to_host()
print(result_host)

Further Reading

This has only been a very quick introduction to GPUs, but hopefully it has shown you some of the potential they offer and some simple ways to use them.

Key Points

  • GPUs are Graphics Processing Units, they have large numbers of very simple processing cores and are suited to some parallel tasks like machine learning and array operations

  • Many laptops and desktops won’t have very powerful GPUs, instead we’ll want to use HPC or Cloud systems to access a GPU.

  • Google’s Colab provides free access to GPUs with a Jupyter notebooks interface.

  • Numba can use GPUs with minor modifications to the code.

  • NVIDIA have drop in replacements for Pandas, Numpy and SciKit learn that are GPU accelerated.

  • For a GPU to access data it must be copied into the GPU’s memory. This can sometimes be a major bottleneck to GPU operations.