The 1D fftshift in CUDA by chessboard multiplication

Vitality Learning
5 min readJul 8, 2022

The convention typically adopted by the FFT libraries in the 1D case is that the 0 frequency is positioned at the first element of the output sequence. The successive elements correspond to the increasing “positive” frequencies and, finally, the last elements of the output array contain the increasing negative frequencies. In many cases, however, it is necessary to reposition the 0 frequency at the center of the transformed sequence. In this case, one needs to operate the so-called fftshift.

Most common programming environments, such as Matlab and Python, have fftshiftroutines. A Python fftshift routine is contained in the numpylibrary.

In the case of parallel programming on GPU carried out using the CUDA language, fftshiftroutines are not available, so it is necessary to build a handcrafted one.

The fftshift implicitly involves memory movements. However, memory operations for graphics cards are notoriously slow. Nevertheless, it is possible to obtain the same result as anffthisft performed with memory movements through floating point operations thanks to the properties of the Discrete Fourier Transform (DFT).

The purpose of this post is to compare routines implemented following the two possible approaches with memory movements and floating point operations showing how the latter is more convenient.

Below we illustrate what a fftshift is, how it can be implemented following the two above approaches and the results of the comparison.

What is fftshift?

fftshift rearranges a multidimensional discrete Fourier transform, represented by a multidimensional array X, by shifting the zero-frequency component to the center of X.

The following Figure illustrates the fftshift in one dimension.

The 1D fftshift.

The 1D fftshift in CUDA

The code at FFTShift_1D.cu compares 5 different ways to perform the fftshift in the case when the number of elements N is even.

1D CUDA fftshift using memory movements

The first two approaches use memory movements, as in

[1] https://github.com/marwan-abdellah/cufftShift.
[2] M. Abdellah, S. Saleh, A Eldeib, A. Shaarawi, “High performance multi-dimensional (2D/3D) FFT-shift implementation on Graphics Processing Units (GPUs)”, Proc. of the 6th Cairo Int. Biomed. Eng. Conf., Cairo, Egypt, Dec. 20–21, 2012, pp. 171–174.

In particular, the first of the memory-movement approaches uses in-place operations, while the second exploits out-of-place operations. The only computational difference between the two is that in-place operations need a temporary variable to swap the two halves of the DFT sequence, while out-of-place operations does not.

1D CUDA fftshift using chessboard multiplication

The other three approaches are based on the following observation.

The DFT relation is given by

The DFT is thus a periodic sequence of period equal to N and has zero frequency at the k=0 index. Furthermore,

Accordingly, if we want to move the zero frequency to k=N/2, then we have to choose M=N/2, so that

In other words, the zero frequency can be moved to the middle of the sequence through a multiplication of the signal by a chessboard of 1s and -1s in the case when the signal has an even number of elements.

It should be noticed that, in the case of an odd number of elements, M should be chosen equal to (N-1)/2 so that

Accordingly, in this case, a full multiplication by a complex exponential is needed.

The chessboard multiplication is performed by using cuFFT callback routines.

cuFFT callback routines

Quoting the cuFFT library User’s guide,

Callback routines are user-supplied kernel routines that cuFFT will call when loading or storing data. They allow the user to do data pre- or post- processing without additional kernel calls.

Kernel calls indeed have a cost, as underlined in this post, and avoiding them as much as possible is a recommendable programming habit.

Notice that, quoting the static library section of the cuFFT library User’s guide

For cuFFT and cufftw in version 9.0 or later any supported architecture can be used to do the device linking:

Static cuFFT compilation command:

Being the chessboard multiplication needed before the FFT, then the callback routine is of load type.

Note that, again quoting the cuFFT library User’s guide concerning the parameters for all of the __device__ load callbacks are defined as below

  • offset: offset of the input element from the start of output data. This is not a byte offset, rather it is the number of elements from start of data.
  • dataIn: device pointer to the start of the input array that was passed in the cufftExecutecall.
  • callerInfo: device pointer to the optional caller specified data passed in the cufftXtSetCallback call.
  • sharedPointer: pointer to shared memory, valid only if the user has called cufftXtSetCallbackSharedSize().

callerInfo and sharedPointer are not used in the considered example.

Performance of the 5 versions of the 1D CUDA fftshift

Two versions of the thefftshift referring to the memory movements approach and returning in-place and out-of-place results and three versions using different ways of implementing the chessboard multiplications are now compared. The comparison refers to the case of an even number N of elements.

The CUDA code, executed within a Google Colab notebook, is available at our GitHub page.

The testing has been performed on an both an old card (GTX 980 — compute capability 5.2) and on a newer card (Tesla T.4 — compute capability 7.5). In the following tables, the test results are shown (timing in ms):

Results for the GTX 980.
Results for the Tesla T4.

As it can be see, a performance gain of about 15%can be achieved by the chessboard multiplication approach against the memory movements one. Note that the second version of the chessboard multiplication approach is very slow due to the use of the pow function.

PyCUDA version

In order to complete the discussion, on our GitHub page, a PyCUDA implementation of thefftshift in one dimension is also proposed. Such an implementation exploits one of the two most performing CUDA kernels examined above and implementing thefftshift by the chessboard multiplication for the case of an even number of elements. In such an implementation, the case of an odd number of elements is also considered by the development of a purposely devised CUDA kernel based on the above calculations.

--

--

Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.