Differential Evolution — Part 1 — Theory
An effective and easy-to-build/easy-to-use tool for global optimization.
In many practical applications, it is necessary to solve multimodal optimization problems in which the functional to be minimized has several trapping local minima. From the implementation point of view and given the hectic times we live, often, given the problem, it is necessary to set up the optimization algorithm in a short time to have a quick result. We therefore need an optimization algorithm that is both simple to implement and easy to use, but at the same time effective for determining the overall optimum. Differential Evolution (DE) is one of those algorithms that is right for us when we have similar needs.
This post is part of a two part post. In the present, the theory is summarized with a certain level of detail. In Part 2, a parallel PyCUDA implementation is discussed. PyCUDA is a Python library that allows you to easily graft parallel CUDA code for GPUs into Python scripts.
Differential Evolution (DE) was introduced by Storn and Price in the 1990s [1] and is an evolutionary-type algorithm aiming at optimizing a cost function by iteratively improving candidate solutions. DE is also addressed as a metaheuristics algorithm since it makes few or no assumption about the problem being optimized.
DE is a population-based, direct search method in that it maintains a population of candidate solutions and creates a new generation by combining population members according to simple formulae. Each generation keeps the candidate solutions having the best cost functionals.
DE does not use the gradient of the cost functional, which means that the latter is not required to be differentiable. DE can thus be used on optimization problems that are even not continuous. As a second advantage, the DE properties are controlled by few control variables which are also easy to choose. Another advantage of DE is that it shows good convergence properties, i.e., it converges to the same global minimum in consecutive independent runs. Finally, DE can be parallelized. As mentioned, parallelization of DE will be the topic of the second part of this post. Some words about the working principles of DE are now in order.
We denote by
the cost functional to be minimized. The number of optimization variables, or the problem dimensions, is D.
We also indicate by
the population at generation g, where Np is the number of population members. Notice that the number of population members does not change through the iterations.
Each population member is
where we have used the convention that index j addresses the population member while index i addresses the member component. Henceforth, we assume that the number of population members is
The functional values at generation g are denoted as
where
The initial population is chosen randomly and should cover the entire search space. Typically, a uniform probability distribution is assumed for all random parameters of the first generation. If a preliminary solution is available, the initial population can be generated by adding normally distributed random deviations to the preliminary solution.
Mutation
In order to create a new generation g+1, the components of each population member are randomly mutated. For the j-th population member, the mutated components are defined according to those of a mutant vector:
where aj, bj and cj are integer random indices, mutually different and different from j,
and typically m=2. The mutually different random indices aj, bj and cj select three different population members which can be regarded as the parents of the j-th individual at the new generation. The mutation occurs then on the basis of the aj-th population member by perturbing it with a quantity related to the difference between the other two parents bj and cj, see the figure below.
Crossover
Not all the components of a population member are mutated. Once the mutant vector has been defined for the j-th population member, a random variable rgji is uniformly drawn in (0,1) for each component. If the random variable is larger than a crossover constant or crossover probability CR, the corresponding component of the original population member is retained, otherwise the corresponding component of the mutation vector is taken.
A trial vector is then defined whose i-th component is given by:
New generation
Finally, to decide whether or not the trial vector should become a member of generation g+1, the trial vector itself is compared with the target vector. If
i.e., the trial vector yields a smaller cost function value than the target vector, then we set
otherwise we set
Other approaches
Other mutation schemes, different from the above one, have been proposed [2].
The naming convention of the proposed approaches is DE/x/y, where DE stands for Differential Evolution, x denotes the vector to be mutated and y the number of difference vectors considered for the mutation of x. For instance:
DE/rand/1
DE/rand to best/1
DE/current to best/1
DE/rand/2
DE/best/2
In all the above formulas, aj, bj, cj, dj and ej are mutually exclusive and different from j. Furthermore, F1 and F2 can be different or equal each other.
Adaptive DEs have been also proposed, where the mutation factor F can be different from member to member, that is, F=Fi [3].
In adaptive DEs, CR can be dependent on the population member, namely, CR=CRj.
The algorithm
The following snippet summarizes the algorithm:
The find_minimum function is needed to control the evolution of DE to find the best population member at each iteration member.
Rereferences
[1] R. Storn, K. Price, “Differential Evolution — A Simple and EfficientHeuristic for Global Optimization over Continuous Spaces,” J. Global Opt., vol. 11, pp. 341–359, 1997.
[2] M. Georgioudakis, V. Plevris, “A Comparative Study of Differential Evolution Variants in Constrained Structural Optimization,” Frontiers in Built Env., vol. 6, article 102, Jul. 2020.
[3] C. Lin, A. Qing, Q. Feng, “A Comparative Study of Crossover in Differential Evolution,” J. Heuristics, vol. 17, pp. 675–703, 2011.