# Improving Performance and Throughput While Implementing FFT Using Tensilica ConnXB20 DSP with TIE

Fast Fourier transform (FFT) is a very useful block in Automotive ADAS radar, communications, robotics, and image processing. The standard FFT architectures require data re-order (Radix-2 approach) or in-between data shuffling (Stockham approach) which provides limited performance. This work represents the implementation of flexible and fast 2^{n}-data points complex FFT architecture using TIE, and uses unique data flow patterns. It uses four dedicated parallel read-write queues (FIFO) to improve the overall performance and throughput. The application specific SIMD and FLIX instructions are created using Tensilica Xtensa, and the FFT architecture is realized using ConnX B20 DSP. This approach requires only 768 cycles to perform complex 4K-FFT when compared to the 2070 cycles requirements in the ConnX B20 (library) itself. It is achieved by a total of 2048-bits read and write per cycle (2 to 4 queues in parallel).

## Introduction

A large size (512 to 2K) FFT is a major requirement and used invariably in applications such as communications, automotive, image processing, autonomous driving, etc. Running efficient FFT in real time has always been a challenging task. For efficient implementation of FFT, 4 FIFO queues are used with a data bandwidth of 32-1024 bits per entry/ cycle and avoids data reordering/ shuffling by using unique data flow pattern. For the implementation of FIFO and results comparisons Tensilica ConnX B20 is used as a reference. In this blog, Prasath Kumaraveeran, Fraunhofer IIS/EAS, describes the challenges of FFT implementations and explains how the usage of Tensilica Instruction Extension (TIE) accelerates the FFT implementation. The video of this talk is available at CadenceLIVE Americas 2021.

## Challenges

There are many issues related to performance of FFT. Data-points are taken from memory and the result of FFT/Butterfly operation is stored back to the memory via shared bus while accessing the common access bus accounting to some delay and this bus-data bandwidth often limit the speed of FFT. Data reordering issue in Radix-2 and Data shuffling in Stockham are other concerns in efficient implementation of FFT. Radix-2 for 8-point FFT is as shown below in Figure 1. This standard implementation involves 2 major issues, the first one being data reordering. Secondly, replicating the implementation from step-2 is not possible in queue.

Figure 1: Radix-2 FFT Implementation issues (a) Reordering (b) replication

So, it is hard to implement Radix-2 FFT using queues. Another, challenge can be examined by evaluating the Stockham implementation of 8-point FFT. It avoids any reordering yet suffers from data reshuffling issue as shown in Figure 2 below and is very complex and not realizable using FIFO (queue).

Figure 2: Stockham Implementation of 8 Point FFT

**Solution **

For the efficient usage of FFT in real time applications, these issues have been addressed in the solution proposed and it avoids shuffle exchange also there is no data reordering and it uses the same butterfly pattern at each stage. The proposed scheme is as shown below in Figure 3:

For the implementation of this complex FFT, the input, output and twiddle factors (32 bits each) have real as well as imaginary components. Also, a separate reusable queue is used to handle the twiddle factors to reduce the computational complexity at the cost of area. So, a trade-off is required between complexity and area. The twiddle factors for the FFT can either be computed during runtime or pre-computed and stored in a separate reusable queue. The latter technique is followed in this work in order to reduce the computational complexity. Four hardware parallelization methods are proposed to achieve better performance based on the FFT sizes as shown in Figure 4 below.

In every cycle, each method reads data-points from two different queues, performs butterfly operation and writes the results not necessarily in the same two queues. The 1st method takes only 2-data points per cycle and performs the butterfly operation, whereas the 2nd, 3rd and 4th methods take 16-, 32- and 64-data points per cycle respectively. Each data point is considered as 16-bits real and 16-bits imaginary fixed-point complex element. The 4th method requires two 32-data points from two respective queues, which corresponds to the data width of 1024-bits per queue. In each cycle, the 1st method reuses 4,32-bit multipliers and 8 16-bit adders that are required to perform a butterfly operation and the 4th method requires reusable 128 multipliers and 256 adders.

The data flow pattern followed in this work ensures the reusability of twiddle factors. Thus, the twiddle factor computation or the twiddle queue read is not required in every cycle. With any fixed hardware parallelization methods, the FFT size can be configured. For instance, with the 1st method, 24 to 212-point FFTs or above can be realized. Similarly, with the 4th method, 29 to 212-point FFTs or above can be realized. With the 1st method, a 16-point complex FFT takes 32 cycles to perform the whole operation. The 4th method computes 512-, 1024-, 2048- and 4096-FFTs within 72, 160, 352 and 768 cycles respectively.

## Outcomes:

The proposed scheme helps in improving the performance and is very effective in resource utilization as it reuses the same butterfly hardware for the computation. The table below, shows the comparison of proposed scheme as compared to simple C language implementation without any TIE file. Hardware implementation method 1 to 4 has shown different speed-ups depending on parallelization effort of Butterfly networks. Results from method 1 to method 4 clearly show the impact of parallelization as it helps in speeding up the operation to 3126 times in method 4 i.e., 3-5 times that of ConnX B20 without parallelization.

With respect to resource utilization, method-1 uses one butterfly network uses 4 multipliers of 32 bits each and 8 adders of 16 bit each. These are reused every time FFT is calculated. Similarly, there are 32, 64 ,128 multipliers of 32 bits each and 64, 128 and 256 adders of 16 bit each in methods 2,3 and 4 respectively. Another, benefit of proposed scheme is that each method can support multiple points of FFT with little adjustments in 2 states/registers only hardware parallelization has to be fixed but the FFT size can be varied during run time.

**Summary: **

FFT has been implemented using Tensilica Instruction Execution (TIE) language and FIFO based system. The parallelization method has been compared with Tensilica ConneXB20 library. The results clearly show the advantages of parallelization scheme as it helps in efficient implementation of FFT for real time applications.