Differential Evolution — Part 2 — PyCUDA Implementation

An effective and easy-to-build/easy-to-use tool for global optimization.

6 min readAug 1, 2022

Source: https://upload.wikimedia.org/wikipedia/commons/f/f5/AiryAi_Real_Surface.png

Recently, we have published the first part of this post on the Differential Evolution (DE) optimization algorithm. In this second part, we give some details on its PyCUDA implementation.

The code structure and the nomenclature reflect the formalism introduced in the previous post. The PyCUDA code will consist of CUDA kernels, each implementing a different step of the algorithm. The only operations performed by library routines skipping the use of CUDA kernels are the generation of random numbers and the determination of the minimum of arrays.

To keep the code compact and easy to understand, here and there, we will priviledge simplicity against advanced optimization.

In the image below, we summarize the steps of the algorithm. Of course, refer to Part 1 for further details.

The full PyCUDA code is available at our GitHub page.

Let’s proceed in order.

The first step requires an initial random generation of the population. This, in turn, necessitates the ability to generate random numbers in parallel (because we want to exploit the capabilities of parallel computing for this operation as well). A premise from this point of view is therefore a must.

Parallel random number generation

Real random numbers are generated by real world events like an atom decaying or not in a certain time. The best we can do with a deterministic machine, like a computer, is to produce pseudorandom numbers.

Pseudorandom number sequences approximate random number sequences, but:

they repeat themselves after a certain sequence index;
they depend on input parameters, typically, a seed; given the pseudorandom number generator the same seed, it will produce always the same sequence; altough such a property is good when testing the correctness of an algorithm, it is not as much useful in applications.

To mimic randomness, we thus need to randomize the seed. But how do we do that?

The classical way is to hook the seed to the machine time. For example, in C

returns the the time since the Epoch (00:00:00 UTC, January 1, 1970), measured in seconds. With this solution, the randomness is obtained from the randomness of the time instant in which the instruction is launched.

In our implementation, however, we will use the Python secrets library.

To generate random numbers, we will use the CUDA cuRANDlibrary. The overall sequence generated by the employed cuRAND approach per seed has a period greater than 2¹⁹⁰. The overall sequence is subdivided into a number of independent subsequences, approximately 2⁶⁷ long.

Different sequences for different seeds.

To distribute the 2¹⁹⁰ period sequence among the various threads and allow the generation of random numbers in parallel, each thread running the parallel random number generation routine has access to a different subsequence only.

Generating random numbers with the cuRAND library needs an initialization which is done via the curand_init() function. This function initializes a state array allocated by the User having the same size of the random array to be generated. The state array stores all the information needed to produce the pseudorandom sequence. Initialization requires a seed, a sequence number selecting the subsequence (generally coinciding with the threadID) and an offset selecting the starting offset within the sequence.

Once the state is initialized, it is possible to generate random numbers evenly distributed between 0 and 1 thanks to the curand_uniform() function:

As you can see, the curand_uniform() function requests, as input, a reference to the state vector so that it can be updated to generate uncorrelated numbers at a subsequent call. For more details, please refer to the cuRAND User’s Manual.

Population initialization

The first step towards optimization with the parallel DE algorithm is the initial random generation of the population. This consists of defining a search domain of the solution, typically a hyperbox, and generating population members by distributing them uniformly in the search space. Conceptually, this is done by generating random numbers evenly distributed between 0 and 1, as pointed out above, and then “stretching” them to the search domain of interest. Below, the CUDA kernels developed to generate the initial population:

The state vector is defined, within the same SourceModule in which all the kernels relevant to the approach are defined, as a static __device__ array whose dimensions are defined at compile time. See the full code at the GitHub repository.

Initial population evaluation

Once the initial population members have been generated, it is necessary to evaluate how much they cost. For this reason, the evaluation_GPU kernel has been developed which evaluates the cost functional in parallel on the various members of the population:

The functional functionalis coded through a __device__ function

In the example above, various functionals often used in evaluating the performance of global optimization algorithms are reported and commented on. These can obviously be used by the User also to test the algorithm at hand, i.e., to verify how the parallel DE algorithm behaves in various different situations.

Generation of mutation indices

Once the initial population assessment is done, the iterations begin. The first function we encounter is the generate_mutation_indices_GPU function. This is necessary to generate the indices aj, bj and cj to be used to carry out the subsequent crossover by defining a mutant vector and creating a new population, see Part 1.

As can be seen, the three indices aj, bj and cj are randomly generated using the curand_uniform function. Mutual exclusivity is achieved through a do loop. Please note that this choice may not be the most convenient from a computational point of view and is dictated only by considerations of simplicity. The mutation indices are then stored inside a two-dimensional array with 3 columns.

Generation of crossover values

Once the mutation indices have been generated, it is necessary to generate the crossover values in order to physically perform the crossover. This is implemented with the very simple generate_crossover_values_GPU kernel and doesn’t require much comment:

Once again, the curand_uniform function is used.

Generation of the trial vectors

Generating trial vectors is also relatively simple. This operation is implemented by the following kernel:

As you can see, a two-dimensional grid of threads is launched in which the y dimension runs on the population elements, while the x dimension runs on the problem dimensions. The trial vectors are then generated using the previously drawn crossover values as a test. The latest operations concern the “saturation” of the trial vectors to the search domain of interest.

The kernel reports different possibilities to be tried for the generation of trial vectors corresponding to the different possibilities identified in Part I of this post.

Selection of the new population

The penultimate step consists in comparing the trial vectors and the corresponding population elements at the previous step. If the functional of the trial vector is lower, the trial vector is selected at the next generation, otherwise the population member of the previous generation is kept. Again, the CUDA kernel that implements the step is relatively simple:

Determining the best population member

The last operation performed inside the loop is the determination of the best element of the population at the generic iterative step. This is done through the cublasIsamin routine of the cuBLAS library. The cublasIsamin routine is obtained thanks to the wrapper offered by the Python skcuda library.

References

P. Krömer, V. Snasel, A. Abraham, “Many-threaded implementation of differential evolution for the CUDA platform,” Proc. of the 13th Annual Conf. on Genetic and Evolutionary Comp. (GECCO11), Jul. 2011, pp. 1595–1602.

L. de P. Veronese, R.A. Krohling, “Differential evolution algorithm on the GPU with C-CUDA,” Proc. of the IEEE Congress on Evolutionary Comp., Jul. 18–23, 2010, pp. 1–7.

F. Fabris, R.A. Krohling, “A co-evolutionary differential evolution algorithm for solving min–maxoptimization problems implemented on GPU using C-CUDA,” Expert Syst. with Appl., vol. 39, n. 12, Sept. 15, 2012, pp. 10324–10333.

A.K. Qin, F. Raimondo, F. Forbes, Y.S. Ong, “An improved CUDA-based implementation of differential evolution on GPU,” Proc. of the 14th Annual Conf. on Genetic and Evolutionary Comp. (GECCO12), Jul. 2012, pp. 991–998.

W. Zhu, “Massively parallel differential evolution — pattern search optimization with graphics hardware acceleration: an investigation on bound constrained optimization problems,” J. Glob. Optim., vol 50, 2011, pp. 417–437.

W. Zhu, Y. Li, “GPU-Accelerated Differential Evolutionary Markov Chain Monte Carlo Method for Multi-Objective Optimization over Continuous Space,” Proc. of the 2nd Workshop on Bio-Insp. Algorithms for Distribut. Syst. (BADS10), Jun. 2010, pp. 1–8.

P. Krömer, Jan Platoš, V. Snášel, A. Abraham, “Many-Threaded Differential Evolution on the GPU,” Massively Parallel Evolutionary Computation on GPGPUs, Springer, pp. 121–147.