PyCUDA, Google Colab and the GPU

Vitality Learning
5 min readMar 18, 2021

--

In this post, which follows The evolution of a GPU: from gaming to computing and Understanding the architecture of a GPU, we provide an introduction to the PyCUDA library and to the Google Colaboratory environment and a short PyCUDA unit sample that can be also run on Google Colab.

What is PyCUDA?

PyCUDA is a library developed by Andreas Klöckneret al. allowing to write CUDA codes and compiling, optimizing and using them as ordinary Python functions in a totally transparent way to the User. The User does not need to manage the CUDA compiler unless he explicitly requests it.

PyCUDA uses the concept of GPU run-time code generation (RTCG) enabling the execution oflow-level code launched by the high-level scripting language offered by Python. The use of RTCG increases the User’s productivity from different points of view.

A first advantage of RTCG is related to the possibility of a low-level programming by writing CUDA kernels only for the portions of the code to be accelerated while using, for the remaining ones, all the functionalities of a high-level language, as graphics or I/O. RTCG enables a run-time code optimization instead of a compile-time one. The former occurs at a more favorable time, when all the information on the machine on which the code must be executed is available. Also, the result of the compilation process is cached and reused if possible, initiating recompilation only when necessary. This is illustrated in figure below where the compilation and caching operations in the gray box are performed transparently to the User.

PyCUDA GPU program compilation workflow.

Finally, it is possible to fully exploit the potentialities of CUDA libraries thanks to the readiness of many wrappers in publicly available libraries or to construct such wrappers by oneself.

A second advantage of RTCG is associated to the possibility of using, within certain limits, a high-level, mathematical-like syntax for GPU executions.
The potentialities of the PyCUDA library are illustrated with simple examples in the post “Five different ways to sum vectors in PyCUDA”.

Let us give now some words on the Google Colaboratory environment.

Google Colaboratory

Google Colaboratory, or Colab, is a totally free development environment based on Jupyter Notebook.

Jupyter Notebook is an open-source, free web application permitting to create and share documents containing codes, equations, text, plots, tables and images and that enables sharing codes on the GitHub platform. In particular, the code is any time modifiable and executable in real time. What implemented can be later exported as Python or ipynb source, where ipynb is a format capable to host all the content of the Jupyter Notebook web application session and including the inputs and outputs of the computations, the images and the comments and that can be finally exported as html, pdf and LaTeX.

Google Colab supports Python 2.7 and 3.6, does not request any configuration and accommodatesthe CPU, GPU or Tensor Processing Unit (TPU) execution, depending on the needs. It hosts libraries like PyTorch, TensorFlow, Keras and OpenCV, so that it is much used for Machine Learning, Deep Learning and also experiments for Computer Vision. It is possible, however, to install also other modules if necessary. In order to exploit Google Colab, it is enough to have a Google account and all the work can be saved on Google Drive.

The first screen that is visualized when Google Colab is launched is a welcome project in which the different possibilities offered by the platform appear.

Google Colab welcome page.

It is possible to create a new notebook, for example in Python 3.6, by selecting New notebook from the File menu. To enable the current session to use the GPU, it is enough to click on Change runtime type of the Runtime menu and select GPU from the Hardware accelerator drop down menu. It is understood that, from now on, such selection should be operated to correctly run the shown example.

Dumping the GPU properties

A first, very simple example enables to dump the properties of the GPU card in use. The example is entirely shown below.

import pycuda.driver as cuda
import pycuda.autoinit
print(“%d device(s) found.” % cuda.Device.count())
dev = cuda.Device(0)
print(“Device: %s”, dev.name())
print(“ Compute Capability: %d.%d” % dev.compute_capability())
print(“ Total Memory: %s KB” % (dev.total_memory()//(1024)))
atts = [(str(att), value)
for att, value in dev.get_attributes().items()]
atts.sort()

for att, value in atts:
print(“ %s: %s” % (att, value))

However, before launching the code, it is necessary to install PyCUDA under the Google Colab environment. This can be done by the following snippet

!pip install pycuda

Going back to the code above, it permits to illustrate the normal workflow of a PyCUDA code.

In particular, the first step is to load the libraries as in a standard Python code. In the above example, two libraries are imported:

  1. pycuda.driver: it contains functions for memory handling, as allocation, deallocation and transfers, for the dumping of information on the GPU card etc.; in the example, the cuda short hand is given to pycuda.driver;
  2. pycuda.autoinit: it does not use a short hand notation and this call serves for the device initialization, memory cleanup and context creation.

The first operation performed in the above listing is that of counting the number of available devices by means of cuda.Device.count(). The second is that of dumping the properties of the only available GPU, namely, GPU number 0, in the dev variable. Later on, the GPU name stored in dev.name(), the compute capability stored in dev.compute_capability() and the available free memory (in bytes) stored in dev.total_memory() are shown. Finally, all the GPU attributes are ordered in an alphabetical order and displayed on screen. An example of output is shown below

1 device(s) found.
Device: %s Tesla P100-PCIE-16GB
Compute Capability: 6.0
Total Memory: 16671616 KB
ASYNC_ENGINE_COUNT: 2
CAN_MAP_HOST_MEMORY: 1
CLOCK_RATE: 1328500
COMPUTE_CAPABILITY_MAJOR: 6
COMPUTE_CAPABILITY_MINOR: 0
COMPUTE_MODE: DEFAULT
CONCURRENT_KERNELS: 1
ECC_ENABLED: 1
GLOBAL_L1_CACHE_SUPPORTED: 1
GLOBAL_MEMORY_BUS_WIDTH: 4096
GPU_OVERLAP: 1
INTEGRATED: 0
KERNEL_EXEC_TIMEOUT: 0
L2_CACHE_SIZE: 4194304
LOCAL_L1_CACHE_SUPPORTED: 1
MANAGED_MEMORY: 1
MAXIMUM_SURFACE1D_LAYERED_LAYERS: 2048
MAXIMUM_SURFACE1D_LAYERED_WIDTH: 32768
MAXIMUM_SURFACE1D_WIDTH: 32768
MAXIMUM_SURFACE2D_HEIGHT: 65536
MAXIMUM_SURFACE2D_LAYERED_HEIGHT: 32768
MAXIMUM_SURFACE2D_LAYERED_LAYERS: 2048
MAXIMUM_SURFACE2D_LAYERED_WIDTH: 32768
MAXIMUM_SURFACE2D_WIDTH: 131072
MAXIMUM_SURFACE3D_DEPTH: 16384
MAXIMUM_SURFACE3D_HEIGHT: 16384
MAXIMUM_SURFACE3D_WIDTH: 16384
MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS: 2046
MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH: 32768
MAXIMUM_SURFACECUBEMAP_WIDTH: 32768
MAXIMUM_TEXTURE1D_LAYERED_LAYERS: 2048
MAXIMUM_TEXTURE1D_LAYERED_WIDTH: 32768
MAXIMUM_TEXTURE1D_LINEAR_WIDTH: 134217728
MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH: 16384
MAXIMUM_TEXTURE1D_WIDTH: 131072
MAXIMUM_TEXTURE2D_ARRAY_HEIGHT: 32768
MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES: 2048
MAXIMUM_TEXTURE2D_ARRAY_WIDTH: 32768
MAXIMUM_TEXTURE2D_GATHER_HEIGHT: 32768
MAXIMUM_TEXTURE2D_GATHER_WIDTH: 32768
MAXIMUM_TEXTURE2D_HEIGHT: 65536
MAXIMUM_TEXTURE2D_LINEAR_HEIGHT: 65000
MAXIMUM_TEXTURE2D_LINEAR_PITCH: 2097120
MAXIMUM_TEXTURE2D_LINEAR_WIDTH: 131072
MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT: 32768
MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH: 32768
MAXIMUM_TEXTURE2D_WIDTH: 131072
MAXIMUM_TEXTURE3D_DEPTH: 16384
MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE: 32768
MAXIMUM_TEXTURE3D_HEIGHT: 16384
MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE: 8192
MAXIMUM_TEXTURE3D_WIDTH: 16384
MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE: 8192
MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS: 2046
MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH: 32768
MAXIMUM_TEXTURECUBEMAP_WIDTH: 32768
MAX_BLOCK_DIM_X: 1024
MAX_BLOCK_DIM_Y: 1024
MAX_BLOCK_DIM_Z: 64
MAX_GRID_DIM_X: 2147483647
MAX_GRID_DIM_Y: 65535
MAX_GRID_DIM_Z: 65535
MAX_PITCH: 2147483647
MAX_REGISTERS_PER_BLOCK: 65536
MAX_REGISTERS_PER_MULTIPROCESSOR: 65536
MAX_SHARED_MEMORY_PER_BLOCK: 49152
MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 65536
MAX_THREADS_PER_BLOCK: 1024
MAX_THREADS_PER_MULTIPROCESSOR: 2048
MEMORY_CLOCK_RATE: 715000
MULTIPROCESSOR_COUNT: 56
MULTI_GPU_BOARD: 0
MULTI_GPU_BOARD_GROUP_ID: 0
PCI_BUS_ID: 0
PCI_DEVICE_ID: 4
PCI_DOMAIN_ID: 0
STREAM_PRIORITIES_SUPPORTED: 1
SURFACE_ALIGNMENT: 512
TCC_DRIVER: 0
TEXTURE_ALIGNMENT: 512
TEXTURE_PITCH_ALIGNMENT: 32
TOTAL_CONSTANT_MEMORY: 65536
UNIFIED_ADDRESSING: 1
WARP_SIZE: 32

In Five different ways to sum vectors in PyCUDA, different examples on how summing arrays in parallel with PyCUDA will be shown.

--

--

Vitality Learning
Vitality Learning

Written by Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.