Using C/C++ and CUDA functions as regular Python functions

Vitality Learning
3 min readNov 1, 2023

This post follows the Running CUDA in Google Colab post in which we illustrated how Google Colab could be used to run CUDA codes. It follows also Five different ways to sum vectors in PyCUDA in which we pointed out how exploiting the PyCUDA library to write GPU codes in a simple way, employing Python-like primitives and allowing to compile and link CUDA kernels in an essentially pythonic program.

Here, we want to show how compiling C/C++ and CUDA functions from the command line, so creating shared libraries which can be then linked under Python thanks to thectypeslibrary to be then invoked as regular Python functions.

The codes of this post are available at the link.

C/C++ functions

Let us begin by exemplifying the integration of C/C++ functionalities into a Python environment through a shared library.

The sumArrays C++ function performs an elementwise summation of two arrays and is encapsulated in the sumArrayCPU.cpp file by using %%writefile sumArrayCPU.cpp. It defines a function employing the extern "C" syntax to ensure compatibility and linkage across language boundaries and is compiled by

Moving to the Python code, the ctypes module is used for incorporating the C++ shared library.

The lines

define the configuration of argument types and return values of the sumArrays function, enabling the interfacing of C++ and Python.

The Python function sum_arrays receives Python lists as inputs and performs their conversion into C arrays, leveraging the ctypes mechanism to convert data structures between the two languages:

Then, it invokes the sumArrays C++ function

The final, C++ result finally undergoes a transition back into Python:

CUDA functions/no PyCUDA

Let’s now see an example in which a CUDA function is integrated without the use of PyCUDA libraries. The idea is to build a C/C++ wrapper that takes input arrays to operate on, proceeds to allocate space on the GPU, moves data from the CPU to the GPU, invokes the CUDA kernel, and finally transfers the results from the GPU to the CPU, handling memory deallocation.

Let’s now break down the code.

This is the CUDA C++ code. It defines a simple CUDA kernel (my_cuda_kernel) that doubles each element of an input array. The my_cuda_function acts as the mentioned wrapper, allocating device memory, copying data to the device, launching the CUDA kernel, copying the result back, and freeing device memory.

This shell command compiles the CUDA code into a shared library (sumArrayGPU.so) using the NVIDIA CUDA Compiler (nvcc). It specifies the architecture (sm_75), the output filename (-o), and includes options for creating a shared library.

Finally, this Python script loads the compiled CUDA library using ctypes. It defines the prototype of the CUDA function, specifying the argument types and return type. It then prepares input data, converts Python lists to ctypes arrays, calls the CUDA function, and prints the result.

CUDA functions with PyCUDA

Let’s now see an example similar to the previous one, but where memory allocations and transfers are done using the PyCUDA library.

We are now going through the last code with detailed comments by highlighting the differences compared to the previous example:

This CUDA C++ code is similar to the previous version, but it now has a more straightforward my_cuda_function that does not handle memory allocations or movements. These tasks are left to PyCUDA in the Python code.

The compilation string is the same as before, producing a shared library for use in Python.

This Python script is similar to the previous one but, as mentioned, with a few key differences:

  1. PyCUDA Integration: it uses PyCUDA for GPU memory management; PyCUDA provides a seamless way to allocate GPU memory (gpuarray.to_gpu) and work with GPU arrays;
  2. Memory Passing: it passes GPU array pointers to the CUDA function; PyCUDA’s ctypes.cast is used to convert GPU array pointers to ctypes.POINTER(ctypes.c_int) for compatibility with the function prototype;
  3. Print GPU Arrays: it prints the GPU arrays directly; PyCUDA GPU arrays, indeed, can be printed to display their contents.

In summary, using PyCUDA simplifies GPU memory management and seamlessly integrates with the existing CUDA code, making it easier to work with GPU arrays in Python.

--

--

Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.