© 2006-2015 Asian Research Publishing Network (ARPN). All rights reserved.



**E** 

www.arpnjournals.com

# DESIGN OF LOW POWER FFT PROCESSORS USING MULTIPLIER LESS ARCHITECTURE

Senthil Sivakumar M.<sup>1</sup>, Gurumekala T.<sup>2</sup>, Sundaram A.<sup>3</sup>, Thandaiah Prabu R.<sup>4</sup>, Arputharaj T.<sup>5</sup> and Banupriya M.<sup>6</sup>

<sup>1</sup>Department of Electronics, Madras Institute of Technology, Chennai, India

<sup>2</sup>Department of Computer Science Engineering, PSR Engineering College, Sivakasi, India

<sup>3</sup>Department of Electronics Engineering, University of Wolkie, Ethiopia

<sup>4</sup>Department of Electronics and Communication Engineering, Prathyusha Institute of Technology and Management, Chennai, India

<sup>5</sup>Department of Electronics and Communication Engineering, St. Joseph College of Engineering and Technology, Tanzania

<sup>6</sup>Department of Computer Science Engineering, National Engineering College, Kovilpatti, India

#### ABSTRACT

In this paper, we present a novel restructured coefficient ordering based 16 point pipelined FFT processor. The projected novel FFT has been designed with the use of fixed radix-4 and single path pipelined architecture. Higher throughput rate is gained from this pipelined architecture when compared to ordinary pipelined architecture. The power consumption issue is fixed by reducing the switching activity with the use of least transition in Hamming distance. Through this, the switching activity of twiddle computation is reduced from 192 to 78 which is consisting 59% of reduction. Introduced multiplier less architecture cuts down the number of computations to realize complex multiplication. The 16-point FFT implementation is done with Verilog HDL and synthesized using 0.18um Cadence RTL compiler. The power evaluation of FFT has been obtained from the circuit net list using a clock frequency of 100MHz.

Keywords: fast fourier transform (FFT), radix-4, parallel architectures, OFDM, Verilog HDL.

#### 1. INTRODUCTION

The FFT processor is a critical block in orthogonal frequency division multiplexing (OFDM) technology. Due to the nature of uncontrollable processing on the same clock frequency of sampling data, most preference is given to pipeline FFT especially for a low power solution or high throughput. The commutator and the complex multiplier blocks at each stage contribute a dominating part of the entire power consumption in the pipelined architecture. This paper proposes an optimal design to minimize one of the significant power consuming factors known as the switching activity. The coefficient ordering method is followed to reduce the amount of switching activity between successive coefficients which are used by complex multipliers. The coefficient ordering requires a consistent data sequence as per new ordering OF coefficients. Thus, we can attain the less hardware complexity and maximum efficiency.

The architecture of digital processing based MC-CDMA receiver consists of FFT block, combiner and Viterbi decoder. In which the logic block FFT is consuming more power compared to other logic blocks. The total power used for the receiver can be reduced significantly by reducing the power consumption of these blocks particularly in FFT block. The logic structure of Multiplier-less based parallel-pipelined FFT architectures for applications of wireless communication is proposed in [7] by Han, Arslan et al. In this paper, the logic block of multiplier is replaced by adders and shift registers. The logic blocks introduced instead of complex multiplier are reducing the total switching activities of FFT and minimizing the count of computations. Through these advancements, it reduced the total power consumption of FFT considerably. Senthil Sivakumar M, et al. introduced the parallel-pipelined FFT architecture for reducing the power consumption of the FFT. The complex logic blocks, i.e., commutator, multiplier, butterfly architectures are modified and introduced for low power consumption. So as to reduce the switching activity, the low power IDR commutator is introduced [1], [2], [3] and [5]. To reduce the power consumption [1], [2] and [5], multiplier less architecture has been implemented. The butterfly architecture is replaced with 2's complement operation and simple adders which diminish the power consumption of FFT notably [1], [2] and [5]. Besides that, these low power architectures are proposed in [4], [6] and [8] with various FFT architectures. The Simulation, design and analysis of low power MIMO-OFDM system and its implementation on FPGA is performed by Bhattacharjee et al. The low power FFT processor's FPGA implementation is suggested in [10] and [11] with an orthogonal frequency division multiplexing method that increase the data transfer rate of transceivers on parallel sub-carriers. This can be used in the multicarrier transceivers to increase the throughput of the device.

In this paper, FFT algorithm is described in Sec II. Next section III describes implementation of coefficient reordered pipelined FFT. Reordering of coefficients and accordingly the reordering of input information are explained in detail with the reference of flow graph of FFT. Sec IV presents the proposed minimum switching activity design and multiplier less approach with the reference of coefficient reordering. The FFT processor has been implemented with 16 bit complex data and is presented in Sec V.

©2006-2015 Asian Research Publishing Network (ARPN). All rights reserved



#### www.arpnjournals.com

#### 2. ALGORITHM

The N point DFT can be expressed as

$$X(K) = \sum_{n=0}^{N-1} x(n) W_N^{-nk}, 0 \le k \le N - 1$$
(1)

The exp(-j2 $\Pi$ nk/N) is usually represented with  $W_N$ <sup>nk</sup>; Where,  $W = exp(-j2\Pi/N)$ . Let, N is a composite number of v integers such that N= r<sub>1</sub> r<sub>2</sub> r<sub>3</sub>.....r<sub>v</sub>, and define

$$N_t = N / r_1 r_2 r_3 \dots r_t$$
, and  $1 \le t \le v-1$  (2)

Where, t is number of stages of the decomposed DFT and  $r_t$  is the radix. With the use of recursive property and the relationship  $W_{Ni \ Nj} \stackrel{Nj \ k}{=} W_{Ni} \stackrel{k}{\stackrel{k}{=}} \text{for radix } r_1$  Equation (1) becomes

$$X(K) = \sum_{q1=0}^{N_r - 1} W_N^{q1k} \sum_{p=0}^{r1-1} X(N1p + q1) W_{r1}^{pk}$$
(3)

The below mentioned equation defines the computation used for the first stage. The final stage is stated as follows.

$$X(r_{1}r_{2}....r_{v-1}m_{v} + ....r_{v}m_{2} + m_{1}) = \sum_{q_{v}-1=0}^{r_{v}-1} W_{r_{v}}^{m_{v}q_{v}-1} x_{v-1}(q_{v-1}, m_{v-1})$$
(4)

The computation used for intermediate stages are given by the following recursive equation,

$$x_{t}(q_{1},m_{t}) = W_{N_{t-1}}^{q_{t}m_{t}} \sum_{p=0}^{r_{t}-1} W_{r_{t}}^{pm_{t}} x_{t-1}(N_{t}p + q_{t},m_{t-1})$$
(5)

Where,

$$\begin{array}{l} 0 \leq q_{t} \leq N_{t} - 1; \quad 2 \leq t \leq v; \quad and \\ N_{t} = \frac{N}{(n_{1}n_{2}, \dots, n_{p})}; \quad 2 \leq t \leq v - 1; \quad 0 \leq m_{t} \leq n - 1 \end{array}$$

For  $r_1$ = 4, based on the above formulation the flow graph of a 16 point FFT is constructed as shown in Figure-1. The corresponding equation is as follows,

$$X(4m2+m1) = \sum_{q1=0}^{3} W_4^{qm2} X_1(q1,m1)$$
(6)

Where,

$$X_1(q1,m1) = W_{16}^{q1m1} \sum_{q1=0}^{3} W_4^{qm1} X(4p+q1)$$

In the Figure-1 each open circle represents a summation, while the dots define boundaries of the stage. The integer outside the open circle is representing the power of FFT twiddle factor  $W_N$  used. The value of m1 (for stage 1) or m2 (for stage 2) are used to denote integer inside each open circle. The FFT coefficient is denoted by the number outside the open circle is the FFT coefficient.

#### 3. REREORDERED PIPELINED FFT STRUCTURE

A pipelined N point radix- 4 FFT Processor is shown in Figure-1 in which the structure is represented based on the previously described algorithm and that will have log<sub>4</sub> N stages. Each stage yields one output within each word cycle where each stage contains a commutator, a butterfly and a complex multiplier. The gained consecutive output of each stage must be well-ordered in accordance with the value of  $m_t$ . For example, in stage 1 of Figure-1, the output related with  $m_1$ =0 is produced in first four word cycles, then those related with  $m_1$ =1 in the next four cycles and so on.



Figure-1. Signal flow graph of Radix-4 16-point FFT.

### 4. IMPLEMENTATION OF FFT

The coefficients of FFT are reordered to minimize the hamming distance for each coefficient transition which is helpful to minimize switching activity between successive coefficients. The hamming distance is stated as the number of 1's present in the XOR operation between two binary coefficients. The 15 bit fixed points are used to encode both the original and ordered coefficient sequence. Transition matrix between each coefficient based on the hamming distance is developed to acquire the minimum switching activity.

 Table-1. The transition matrix of switching activity between two coefficients.

|                | $\mathrm{W}^0$ | $\mathbf{W}^1$ | $W^2$ | $W^3$ | $\mathrm{W}^4$ | $\mathrm{W}^{6}$ | W <sup>9</sup> |
|----------------|----------------|----------------|-------|-------|----------------|------------------|----------------|
| $\mathbf{W}^0$ | 0              | 15             | 17    | 19    | 3              | 21               | 13             |
| $\mathbf{W}^1$ | 15             | 0              | 14    | 16    | 12             | 20               | 24             |
| $W^2$          | 17             | 14             | 0     | 14    | 14             | 16               | 14             |
| $W^3$          | 19             | 16             | 14    | 0     | 16             | 12               | 18             |
| $W^4$          | 3              | 12             | 14    | 16    | 0              | 20               | 16             |
| $W^6$          | 21             | 20             | 16    | 12    | 20             | 0                | 16             |
| W <sup>9</sup> | 13             | 24             | 14    | 18    | 16             | 16               | 0              |

**ARPN** Journal of Engineering and Applied Sciences



#### www.arpnjournals.com

From the transition matrix, we can arrange the twiddle factor FFT to minimize the switching activity easily. By this way switching activity of FFT reduced from 192 to just 78, a reduction of 58% achieved by this approach. To have a change in coefficient reordering corresponding data to be reordered accordingly. This reordered data sequence has to be converted back into normal data sequence for its stage 2.

From the transition matrix it is proved that the switching activity decreases from 192 to just 78, a reduction of 58%. The multiplier less architecture proposed in this paper uses shifter, adders and subtrators to replace the complex multipliers of conventional multiplier using common sub expression sharing method. The proposed shift registers and adders reduce the total power consumption of FFT processor by reducing the switching activity, amount of computations which are existing with the complex multipliers.

**Table-2.** Ordered and conventional coefficient sequence for a 16 point radix -4 FFT.

| Conventional    | Coefficients | Ordered                 | Coefficients |  |
|-----------------|--------------|-------------------------|--------------|--|
| W0              | 4000,0000    | W0                      | 4000,0000    |  |
| W0              | 4000,0000    | W0                      | 4000,0000    |  |
| W0              | 4000,0000    | W0                      | 4000,0000    |  |
| W0              | 4000,0000    | W0                      | 4000,0000    |  |
| W0              | 4000,0000    | W0                      | 4000,0000    |  |
| W1              | 3b20,e782    | W0                      | 4000,0000    |  |
| W2              | 2d41,d2be    | W0                      | 4000,0000    |  |
| W3              | 187d,c4df    | W4                      | 0000,c000    |  |
| W0              | 4000,0000    | W1                      | 3b20,e782    |  |
| W2              | 2d41,d2be    | W2                      | 2d41,d2be    |  |
| W4              | 0000,c000    | W2                      | 2d41,d2be    |  |
| W6              | d2bf,d2bf    | W3                      | 187d,c4df    |  |
| W0              | 4000,0000    | W3                      | 187d,c4df    |  |
| W3              | 187d,c4df    | W6                      | d2bf,d2bf    |  |
| W6              | d2bf,d2bf    | W6                      | d2bf,d2bf    |  |
| W9              | c4e0,187e    | W9                      | c4e0,187e    |  |
| Total number    | of Switching | Total number of         |              |  |
| transitions-192 |              | Switching transitions - |              |  |
|                 |              | 78                      |              |  |



Figure-2. Signal flow graph of Radix-4 ordered 16point FFT.

The change in coefficient ordering involves corresponding data reordering accordingly. The ordering can be generated by a commutator which pipelines the serial input data to a four parallel output for a radix- 4 butterfly computation.

#### a) Multiplier-less R4SDC FFT architecture

The number of multiplications that are used as a key metric for comparing FFT algorithms required additional evaluation time, which impact large on the execution time and power consumption of FFT processor. In this paper, we present 16 point FFT butterfly, which reduces the complexity present in multiplier by using real, constant multiplications. In this approach Canonical Signed-Digit (CSD) and Common sub expression sharing techniques are used to reduce the power consumption of the proposed multiplier. Canonical Signed-Digit (CSD) is one widely used redundancy of signed digit code to replace the conventional multiplier digits. Common sub expression sharing segments the sub expression among several multiplication-accumulation operations to reduce the total number of addition and shift operations [1] and [2].

#### b) Complex multiplication

First, we discuss the effectuations of realizing complex multiplications with real multiplication procedure. The product of complex numbers=A+jB and Y=C+jD is (A+jB) (C+jD)=(AC-BD)+j(AD+BC). The direct computations of complex multiplications implemented by employing four real multiplications and two additions which increase the chip area and power consumption.

Alternatively to reduce the complex multiplications the original computations are modified as follows.

m0=(A+B)(C+D) m1=AC m2=BD (A+jB)(C+jD)=(m1-m2)+j(m0-m1-m2)

Hence the complex multiplications can be reduced to three real multiplications and three additions. The above implementations are applicable for general complex multiplications, i.e., the data and coefficients are variable.

#### 5. RESULTS AND ANALYSIS

The conventional and ordered pipelined FFT% processor architectures have been implemented and then synthesization is done with the use of 0.18um Cadence RTL Compiler. Power evaluation was then carried on the circuit net list using a clock frequency of 16MHz. The switching activity decreases from 192 to 78, a reduction of 58%. The comparative power consumptions for different FFT lengths, FIFO implementation modes and various low power multipliers are given in Table-3.

The FIFO was implemented in two different ways, namely SR (shift register based) and DM (dual port RAM based). A 16 point FFT processor design has been carried ©2006-2015 Asian Research Publishing Network (ARPN). All rights reserved.



#### www.arpnjournals.com

out using three different multiplier namely Wallace tree, carry save array (CSA) and Non booth coded Wallace tree (NBW) types of multipliers. It is proved from Table-3 that the ordered architecture gives power savings for two multiplier types.

**Table-3.** Power consumption table of 16-Point FFT.

| 16 Point FFT processor              |                        |                                    |                 |  |  |  |  |
|-------------------------------------|------------------------|------------------------------------|-----------------|--|--|--|--|
|                                     | Type of multipliers    |                                    |                 |  |  |  |  |
| FFT                                 | Carry<br>save<br>array | Non-booth<br>coded<br>Wallace tree | Wallace<br>tree |  |  |  |  |
| Conventional SR based in (mW)       | 76.65                  | 78.03                              | 75.33           |  |  |  |  |
| Conventional<br>DM based in<br>(mW) | 182.97                 | 170.56                             | 171.16          |  |  |  |  |
| SR based (mW)                       | 70.97                  | 68.45                              | 73.13           |  |  |  |  |
| DM based (mW)                       | 124.47                 | 114.47                             | 121.39          |  |  |  |  |
| Ordered (mW)                        | 65.49                  | 58.24                              | 70.92           |  |  |  |  |
| % saving (SR)                       | 8                      | 14                                 | 3               |  |  |  |  |
| % saving (DM)                       | 47                     | 49                                 | 41              |  |  |  |  |



Figure-3. Power consumptions of ordered and conventional FFT.

#### 6. CONCLUSIONS

In this paper, the modified coefficient ordering technique is applied to 16 point FFT processor. The introduced multiplier less architecture is implemented with three different techniques: Carry save adder, Non-booth coded Wallace tree, Wallace tree. The commutator of FFT is implemented with shift registers and DRAM methods and the power consumptions of different structures are compared with each other. The FFT processor also implemented with conventional and proposed methods using Verilog HDL and synthesized using cadence design tool. The switching activity decreases from 192 to 78 in a whole 16 point cycle, a reduction 61%. It is shown that the switching activity is minimized through the transition matrix.

#### REFERENCES

- [1] Senthil Sivakumar M., Banupriya M. and Arockia Jayadhas. 2012. "Design of Low Power High Performance 16-Point 2-Parallel Pipelined FFT Architecture", International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD), Vol.2, No. 3 Sep 2012, pp. 12-26.
- [2] Senthil Siva Kumar M., Arockia Jayadhas S., Arputharaj T. and Banupriya M. 2013. "Design of adaptive MC-CDMA receiver using low power parallel-pipelined FFT architecture" IEEE proc. on PACT, pp.29-33.
- [3] Wei Han, Ahmet T. Erdogan, Tughrul Arslan and Mohd. Hasan. 2008. "High-Performance Low-Power FFT Cores," ETRI Journal, Vol. 30, No. 3, June.
- [4] W. Han, A.T. Erdogan T. Arslan and M. Hasan. 2004. "A Novel Low Power Pipelined FFT Based on Sub expression Sharing for Wireless LAN Applications," IEEE Signal Processing Systems Workshop (SIPS), October.
- [5] Senthil Sivakumar M., S. A. Jayadhas, Arputharaj T. and Banupriya M. 2013. "Design of Dynamically Reconfigurable Adaptive MC-CDMA Receiver using FFT Architecture", African Journal of Information and Communication Technology (AJICT), Vol.7, No.2, pp.49-61.
- [6] Molisch A. 2011. "Orthogonal Frequency Division Multiplexing (OFDM)", Wiley-IEEE Press eBook, pp. 417-443.
- [7] Han, Arslan, Erdogan and Hasan. 2005. "Multiplierless based parallel-pipelined FFT architectures for wireless communication applications", (ICASSP '05), IEEE, Vol. 5, pp. 45-48.
- [8] Wei Han, Ahmet T. Erdogan, Tughrul Arslan and Mohd. Hasan. 2008. "High-Performance Lo w-Power FFT Cores," ETRI Journal, Vol. 30, No. 3, June.
- [9] Bhattacharjee, Sil, Dey, Chakrabarti. 2011. "Simulation, design and analysis of low power MIMO-OFDM system and its implementation on FPGA", (ReTIS), IEEE, pp. 88-93.
- [10] Zhuo Qian, Nasiri N, Segal O, Margala M, "FPGA implementation of low-power split-radix FFT processors" 24<sup>th</sup> International Conference on Field Programmable Logic and Applications (FPL), 2014, pp.1 – 2.
- [11] In-Gul Jang, Zhe-Yan Piao, Ze-Hua Dong, Jin-Gyun Chung and Kang-Yoon Lee. 2011. "Low-power FFT

## ARPN Journal of Engineering and Applied Sciences

©2006-2015 Asian Research Publishing Network (ARPN). All rights reserved.



#### www.arpnjournals.com

design for NC-OFDM in cognitive radio systems", IEEE International Symposium on Circuits and Systems (ISCAS), pp. 2449–2452.