Working out CUDA Just-In-Time (JIT) compilation

The CUDA workflow followed by many programmers consists of writing a code by distributing it in various .cpp and .cu files, where the .cu files contain the __global__ functions, while the.cpp files contain allocations of memory GPU spaces worked out by cudaMalloc, memory movements from host to device and vice versa performed by cudaMemcpy and __global__ function invokations executed by the

<<<.,.>>>

syntax. These files are compiled and then executed.

In some cases, it is convenient or necessary to compile the __global__ functions at run-time, using the so-called Just-In-Time (JIT) compilation, instead of doing it in advance at compile-time. The JIT compilation can be convenient to improve the performance since run-time compilation occurs in a moment when all the hardware information is available. The JIT compilation can be necessary in all those applications, like OptiX, requiring the __global__ functions performing the ray tracing to be provided in PTX (Parallel Thread eXecution) language, namely, in a parallel GPU language closer to machine code. In particular, these applications require, as input, the PTX code in a character string.

To enlighten how JIT compilation can be implemented in CUDA by the NVIDIA Runtime Compiler (NVRTC) library, we present a code in which two __global__ functions, namely kernel1 and kernel2, are defined as character strings. Their definitions occur by using properly initialized arrays of char or by loading the string from an ASCII file.

The compilation and the return of a string containing the PTX translations of the __global__ functions occurs thanks to the following NVRTC functions: nvrtcCreateProgram (creates a “program” by associating it to the string containing the kernels to be compiled), nvrtcCompileProgram (performs the JIT compilation), nvrtcGetProgramLogSize (determines the size of the log so that a char array can be defined and the log assigned to it), nvrtcGetProgramLog (assings the log to the char array), nvrtcGetPTXSize (determines the size of the PTX code so that a char array can be defined and the PTX code assigned to it), nvrtcGetPTX (assigns the PTX code to the char array) and nvrtcDestroyProgram (destroys the “program”).

The generated PTX string is then used by the low-level CUDA driver.

To use the low-level CUDA driver, it is necessary to initialize the GPU with cuInit and define the device to be used by cuDeviceGet. Later on, it is necessary to create the context by cuCtxCreate, load the PTX code in a module by cuModuleLoadDataEx and extract the kernels to be used in the computation by cuModuleGetFunction.

When using the low-level CUDA driver, the memory allocations are performed by cuMemAlloc instead of cudaMalloc, memory movements by cuMemcpyHtoD or cuMemcpyDtoH instead of cudaMemcpy, the kernel executions by cuLaunchKernel instead of the <<<.,.>>> syntax ad freeing the memory by cuMemFree instead of cudaFree.

--

--

--

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Announcing the Launch of Unloc on Devnet — Put Your NFTs to Work

iOS Push Notifications, but without user authentication!

Embedding Jupyter Notebooks into your Python application

11-Cool Down System for Laser Shooting

Using rtop For Remote Server Monitoring

A Brief Overview Of Regression Testing — pCloudy

Go-Jek Upscale 3.0 — A Review

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vitality Learning

Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.

More from Medium

The Ultimate Guide To Robot Control Programming

Yes, learning a programming language is not necessary, I will tell you why

Compilation in C

Compilation Process

Turning objects lazy in Python