The 1D fftshift in CUDA by chessboard multiplication
The convention typically adopted by the FFT libraries in the 1D case is that the 0
frequency is positioned at the first element of the output sequence. The successive elements correspond to the increasing “positive” frequencies and, finally, the last elements of the output array contain the increasing negative frequencies. In many cases, however, it is necessary to reposition the 0
frequency at the center of the transformed sequence. In this case, one needs to operate the so-called fftshift
.
Most common programming environments, such as Matlab and Python, have fftshift
routines. A Python fftshift
routine is contained in the numpy
library.
In the case of parallel programming on GPU carried out using the CUDA language, fftshift
routines are not available, so it is necessary to build a handcrafted one.
The fftshift
implicitly involves memory movements. However, memory operations for graphics cards are notoriously slow. Nevertheless, it is possible to obtain the same result as anffthisft
performed with memory movements through floating point operations thanks to the properties of the Discrete Fourier Transform (DFT).
The purpose of this post is to compare routines implemented following the two possible approaches with memory movements and floating point operations showing how the latter is more convenient.
Below we illustrate what a fftshift
is, how it can be implemented following the two above approaches and the results of the comparison.
What is fftshift?
fftshift
rearranges a multidimensional discrete Fourier transform, represented by a multidimensional array X
, by shifting the zero-frequency component to the center of X
.
The following Figure illustrates the fftshift
in one dimension.
The 1D fftshift
in CUDA
The code at FFTShift_1D.cu compares 5
different ways to perform the fftshift
in the case when the number of elements N
is even.
1D CUDA fftshift
using memory movements
The first two approaches use memory movements, as in
[1] https://github.com/marwan-abdellah/cufftShift.
[2] M. Abdellah, S. Saleh, A Eldeib, A. Shaarawi, “High performance multi-dimensional (2D/3D) FFT-shift implementation on Graphics Processing Units (GPUs)”, Proc. of the 6th Cairo Int. Biomed. Eng. Conf., Cairo, Egypt, Dec. 20–21, 2012, pp. 171–174.
In particular, the first of the memory-movement approaches uses in-place operations, while the second exploits out-of-place operations. The only computational difference between the two is that in-place operations need a temporary variable to swap the two halves of the DFT sequence, while out-of-place operations does not.
1D CUDA fftshift
using chessboard multiplication
The other three approaches are based on the following observation.
The DFT relation is given by
The DFT is thus a periodic sequence of period equal to N
and has zero frequency at the k=0
index. Furthermore,
Accordingly, if we want to move the zero frequency to k=N/2
, then we have to choose M=N/2
, so that
In other words, the zero frequency can be moved to the middle of the sequence through a multiplication of the signal by a chessboard of 1
s and -1
s in the case when the signal has an even number of elements.
It should be noticed that, in the case of an odd number of elements, M
should be chosen equal to (N-1)/2
so that
Accordingly, in this case, a full multiplication by a complex exponential is needed.
The chessboard multiplication is performed by using cuFFT callback routines.
cuFFT callback routines
Quoting the cuFFT library User’s guide,
Callback routines are user-supplied kernel routines that cuFFT will call when loading or storing data. They allow the user to do data pre- or post- processing without additional kernel calls.
Kernel calls indeed have a cost, as underlined in this post, and avoiding them as much as possible is a recommendable programming habit.
Notice that, quoting the static library section of the cuFFT library User’s guide
For cuFFT and cufftw in version 9.0 or later any supported architecture can be used to do the device linking:
Static cuFFT compilation command:
Being the chessboard multiplication needed before the FFT, then the callback routine is of load type.
Note that, again quoting the cuFFT library User’s guide concerning the parameters for all of the __device__
load callbacks are defined as below
offset
: offset of the input element from the start of output data. This is not a byte offset, rather it is the number of elements from start of data.dataIn
: device pointer to the start of the input array that was passed in thecufftExecute
call.callerInfo
: device pointer to the optional caller specified data passed in thecufftXtSetCallback
call.sharedPointer
: pointer to shared memory, valid only if the user has calledcufftXtSetCallbackSharedSize()
.
callerInfo
and sharedPointer
are not used in the considered example.
Performance of the 5
versions of the 1D CUDA fftshift
Two versions of the thefftshift
referring to the memory movements approach and returning in-place and out-of-place results and three versions using different ways of implementing the chessboard multiplications are now compared. The comparison refers to the case of an even number N
of elements.
The CUDA code, executed within a Google Colab notebook, is available at our GitHub page.
The testing has been performed on an both an old card (GTX 980 — compute capability 5.2
) and on a newer card (Tesla T.4 — compute capability 7.5
). In the following tables, the test results are shown (timing in ms
):
As it can be see, a performance gain of about 15%
can be achieved by the chessboard multiplication approach against the memory movements one. Note that the second version of the chessboard multiplication approach is very slow due to the use of the pow
function.
PyCUDA version
In order to complete the discussion, on our GitHub page, a PyCUDA implementation of thefftshift
in one dimension is also proposed. Such an implementation exploits one of the two most performing CUDA kernels examined above and implementing thefftshift
by the chessboard multiplication for the case of an even number of elements. In such an implementation, the case of an odd number of elements is also considered by the development of a purposely devised CUDA kernel based on the above calculations.