Working out CUDA Just-In-Time (JIT) compilation

Vitality Learning
2 min readApr 16, 2021

The CUDA workflow followed by many programmers consists of writing a code by distributing it in various .cpp and .cu files, where the .cu files contain the __global__ functions, while the.cpp files contain allocations of memory GPU spaces worked out by cudaMalloc, memory movements from host to device and vice versa performed by cudaMemcpy and __global__ function invokations executed by the

<<<.,.>>>

syntax. These files are compiled and then executed.

In some cases, it is convenient or necessary to compile the __global__ functions at run-time, using the so-called Just-In-Time (JIT) compilation, instead of doing it in advance at compile-time. The JIT compilation can be convenient to improve the performance since run-time compilation occurs in a moment when all the hardware information is available. The JIT compilation can be necessary in all those applications, like OptiX, requiring the __global__ functions performing the ray tracing to be provided in PTX (Parallel Thread eXecution) language, namely, in a parallel GPU language closer to machine code. In particular, these applications require, as input, the PTX code in a character string.

To enlighten how JIT compilation can be implemented in CUDA by the NVIDIA Runtime Compiler (NVRTC) library, we present a code in which two __global__ functions, namely kernel1 and kernel2, are defined as character strings. Their definitions occur by using properly initialized arrays of char or by loading the string from an ASCII file.

The compilation and the return of a string containing the PTX translations of the __global__ functions occurs thanks to the following NVRTC functions: nvrtcCreateProgram (creates a “program” by associating it to the string containing the kernels to be compiled), nvrtcCompileProgram (performs the JIT compilation), nvrtcGetProgramLogSize (determines the size of the log so that a char array can be defined and the log assigned to it), nvrtcGetProgramLog (assings the log to the char array), nvrtcGetPTXSize (determines the size of the PTX code so that a char array can be defined and the PTX code assigned to it), nvrtcGetPTX (assigns the PTX code to the char array) and nvrtcDestroyProgram (destroys the “program”).

The generated PTX string is then used by the low-level CUDA driver.

To use the low-level CUDA driver, it is necessary to initialize the GPU with cuInit and define the device to be used by cuDeviceGet. Later on, it is necessary to create the context by cuCtxCreate, load the PTX code in a module by cuModuleLoadDataEx and extract the kernels to be used in the computation by cuModuleGetFunction.

When using the low-level CUDA driver, the memory allocations are performed by cuMemAlloc instead of cudaMalloc, memory movements by cuMemcpyHtoD or cuMemcpyDtoH instead of cudaMemcpy, the kernel executions by cuLaunchKernel instead of the <<<.,.>>> syntax ad freeing the memory by cuMemFree instead of cudaFree.

--

--

Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.