ARPN Journal of Engineering and Applied Sciences © 2006-2015 Asian Research Publishing Network (ARPN). All rights reserved.

www.arpnjournals.com

# PARALLEL MULTIPLIER-ACCUMULATOR UNIT BASED ON VEDIC MATHEMATICS

Jithin S. and Prabhu E.

Department of Electronics and Communication Engineering, VLSI Design, Testing and Security Group Amrita Vishwa Vidyapeetham University, Coimbatore, India

#### ABSTRACT

In this paper, an efficient parallel multiplier and accumulator (MAC) unit based on Vedic mathematics is presented. Vedic mathematics utilizes the Urdhva-tiryagbhyam sutra for the multiplier design. The proposed MAC architecture enhances the speed of operation while reducing the gate area and power dissipation. We also achieve improved delay with the help of Vedic encoder followed by the removal of accumulator stage by parallelizing the intermediate results feeding the input. Such pipelining of the midway results, prior to the final adder, has the effect of combining the accumulator stage with the partial product stage of the multiplier. Further, the overall computation speed of MAC unit is elevated by the efficient use of higher order compressors in the merged partial product compression and accumulator (PPCA) architecture. The area, timing and power reports show that, the critical path delay of the proposed design is significantly reduced and it outperforms the existing designs. We report an absolute improvement of 20-30% and 7-18% respectively for the 4-bit and 8-bit Vedic MAC units, in terms of its total circuit power, critical path delay and cell area. The architecture was synthesized using standard 90nm CMOS library and implemented on Altera's Cyclone II series FPGA.

Keywords: Vedic mathematics, 4:2 compressor adder, Urdhva-tiryagbhyam sutra, Parallel MAC unit, modified booth multiplier.

### **1. INTRODUCTION**

Multiply-Accumulate units are widely used in Digital Signal Processing (DSP), cryptography, real time audio and multimedia image processing applications [1]. Most of the signal processing techniques such as discrete cosine transform, discrete wavelet transform involves continuous multiplication and addition. Therefore, the high speed MAC unit can complete the entire calculations in short period of time. The processing speed largely depends on speed of the multiplier. Higher radix modified booth [2] is an efficient algorithm for high speed multiplication. High speed MAC unit can carry out large number of computations in less time delay. But, it will not completely resolve the issue due to higher critical path delay in the multiplication stages. In general, multiplier is divided into three stages: Booth encoder, partial product addition and final adder stage.

The researches in MAC hardware mainly focus on enhancing its speed, so as to achieve better performance. A highly advanced high speed MAC unit was proposed by Elguibaly [3]. In that architecture overall critical path delay is reduced by combining accumulator stage with multiplier's partial product addition stage. Recently, Young-Ho Seo *et al* [4] proposed another high speed parallel MAC architecture based on Modified Booth Algorithm [1] in 1's complement form. It uses carry Lookahead Adder (CLA) in CSA array architecture which reduces the number of inputs given to the final adder, thereby reducing the critical path delay.

In this paper, a modern delay efficient MAC unit based on Vedic mathematics [5] is proposed. In Vedic mathematics, multiplier is designed by using Urdhvatiryagbhyam (UT) sutra rediscovered by S.B.K Tirthaji [5] [6]. In Vedic parallel MAC hardware, booth encoder is replaced by Vedic encoder [7] having less gate delay. The merged Vedic partial product addition [7] and accumulation stage is optimized by higher order compressors [8].

The paper is structured into five Sections, In Section II; introduction about conventional parallel MAC unit is discussed. In Section III, detailed description of proposed MAC unit and PPCA architecture are explored. The analysis of implementation and comparison results is discussed in Section IV. Finally, summary of the proposed architecture and future scope are given in Section V.



Figure-1. Modified booth parallel MAC architecture.

## 2. CONVENTIONAL PARALLEL MAC ARCHITECTURE

The architecture of parallel MAC unit [4] is shown in Figure-1. It exploits 1's complement based Radix-2 modified booth algorithm for multiplier design. Accumulator portion is merged with partial product addition section of multiplier, to form hybrid type carry



save adder (CSA) tree architecture. The accumulation and addition of partial products occurs on these Wallace tree based CSA.MAC execution is further improved by using efficient sign extension scheme [9] in partial product addition design. #

Let the multiplicand and multiplier is of n bits, then partial products generated by the radix-2 based booth encoder is minimized half of n bits, which in turn drops the steps in CSA tree. The carry look ahead adder present in CSA tree will reduce the number of bits input to the final adder, thereby reducing overall critical path delay. U, V and W represents the midway outcomes, which helps in accumulation of results during next multiplication operation. Finally, most significant bits (MSB) of MAC result is taken from final full adder logic circuit.

#### 3. PROPOSED PARALLEL MAC UNIT

In this section, a new Vedic parallel MAC unit is derived from the concept of conventional design [3] [4]. In addition, an efficient partial product compression and accumulation (PPCA) architecture is proposed, by merging accumulator with the Vedic multiplier's partial product addition part.

The 3 stage multiplication-accumulation operation of Vedic MAC is shown in Figure-2. From the given n bit multiplicand and multiplier, Vedic encoder in stage 1 will generate partial products of multiplication. Accumulation operation is carried out by pipelining the intermediate results from stage 2 instead of that from stage 3. Final addition process is represented by stage 3 and it will not run until final accumulation operation. During each multiplication process, the previous results fed back to the input of stage 2, without passing to the final adder.



Figure-2. Proposed Vedic parallel MAC architecture.



Figure-3. Vedic multiplier architecture.

The Vedic MAC architecture equivalent to the Figure-1 design is shown in Figure-2. X and Y are n bit wide MAC inputs, and are converted to partial products by UT based Vedic encoder. PPCA stage is designed by using array of 4:2 compressor, full adder and half adder circuits. P, Q represents MSB bits and M represents LSB bits of the intermediate result parallelized for next accumulation. Finally, MAC result is taken by combining R [2n-1: n] and R [n-1: 0] bits shown in Figure-2, where R [2n-1: n] bits are generated by summing S and C in final adder circuit.

#### A. Vedic multiplier

Most highly preferred multiplication algorithm in Vedic mathematics is Urdhva tiryagbhyam [6], In Sanskrit, it refers to "vertically" and "crosswise". It can execute both integer and binary multiplication by using same concept. Vedic multipliers consist of encoder and partial product addition stage, hardware architecture of Vedic multiplier is shown in Figure-3.

In comparison with encoder stage of existing multiplication algorithms, partial products generated by Vedic encoder [7] consist only single logical 'AND' gate delay. Since the generation of partial products and its summation occurs in parallel, the multiplier is separated from processors clock frequency. Partial product summation operation is accomplished by using half adders and full adders. Consider the multiplication of two 2 bit numbers A and B,  $P_0$  to  $P_3$  represent the final results of the multiplication and it can be obtained from equation (1) to (4).  $C_1$  and  $C_2$  represent the carry bits produced during addition of partial products.

$$\mathbf{P}_0 = (\mathbf{A}_0 \times \mathbf{B}_0). \tag{1}$$

(3)

$$C_1P_1 = (A_0 \times B_1) + (A_1 \times B_0).$$
 (2)

$$\mathbf{C}_2\mathbf{P}_2 = (\mathbf{A}_1 \times \mathbf{B}_1) + \mathbf{C}_{1.}$$

$$\mathbf{P}_3 = \mathbf{C}_2. \tag{4}$$

The partial product addition stage of Vedic multiplier is modified by combining it with accumulator stage, which results in a new PPCA architecture.

#### **B.** Compressor adder

A compressor adder [8] is a digital circuit capable of adding four or more bits at a time, and are used to improve the overall computational speed of processors. The array of several full adders and half adders present in proposed PPCA architecture is substituted by Compressors, thereby achieving high performance MAC architecture. This paper mainly concentrate on 4:2 compressor adder [10] as shown in Figure-4, it can add one carry bit and four input bits to produce three output bits. In Figure-4, X1, X2, X3 and X4 are input bits and Cin represent carry input bit.

If we assume individual gate delay as 1tp, the total critical path delay of normal 5 bit adder is 4tp and for 4:2 compressor adder is 3tp. Therefore, about 20% improvement in speed can be achieved by using 4:2 compressors, instead of conventional full adders for 5 bit addition operation.

#### **C. Proposed PPCA circuit**

The PPCA architecture that performs both partial product addition and accumulation is shown in Figure-5, which will carry out  $4\times4$  - bit operation. In Figure-5, k[i] is equivalent to the ith bit of partial products generated by Vedic encoder shown in Figure-3. Considering multiplication operation is for 4 bits, totally 16 partial products k [15:0] are generated.



Figure-4. Compressor adder architecture.



Figure-5. Proposed 4 bit PPCA architecture.

P[i] and Q[i] correspond to sum and carry of ith bit that are fed back to the input for accumulation. M[i] represent the lower bits (R [3:0]) of final MAC result produced in advance by adding partial products. After completing final multiplication operation, generated P[i] and Q[i] are given as input to the final adder circuit, so as to output higher bits (R [7:4]) of final MAC result. p[i], q[i], m[i] represents previous intermediate result's used for accumulation. The 4:2 compressor adder will carry out 5 bit addition operation in less gate delay compared to conventional full adder circuits. Several combinations of full adder circuits in PPCA architecture were efficiently replaced by 4:2 compressors, to speed up the MAC operation. If there occurs insufficiency of one input bit in 4:2 compressor adder, it will be represented by '0'in PPCA architecture. XOR gates are used at the at MSB part of the design to prevent the carry propagation.

Since lower bits of MAC result are generated ahead of final adder stage, the number of bits input to the final adder will decrease, thereby achieving reduced critical path delay. The PPCA architecture that performs  $8\times8$  -bit operation is shown in Figure-6. The architecture is designed by using same concept of  $4\times4$  -bit operation shown in Figure-5.

#### D. Final adder

Final adders are logical circuits capable of adding variable number of binary inputs. Mostly, two different types of Final adders are used for the summation of sum and carry bits from PPCA architecture and that are CLA and Kogge stone adder (KSA) [11]. Since in both circuits carries are computed in advance by using input bits, thereby high speed up can be achieved through reduced critical path delay.





Figure-6. Proposed 8 bit PPCA architecture.

| Table-1. Synthesize report | rt of 4 bit MAC architecture. |
|----------------------------|-------------------------------|
|----------------------------|-------------------------------|

| Pre-layout synthesize r           | [4]                    | Proposed<br>design |        |
|-----------------------------------|------------------------|--------------------|--------|
| Critical path delay analysis (ns) |                        | 1.85               | 1.33   |
| Area analysis (µm)                | Combinational area     | 1308.6             | 940    |
|                                   | Non combinational area | 298.5              | 248.8  |
|                                   | Net interconnect area  | 64.5               | 37.3   |
|                                   | Total area             | 1671.8             | 1226.2 |
| Power analysis (µw)               | Total dynamic power    | 332.8              | 223.9  |
|                                   | Cell leakage power     | 8.56               | 6.23   |

#### 4. IMPLEMENTATION AND EXPERIMENT

In this section, implementation, analysis, and comparison of proposed Vedic MAC unit with existing parallel MAC architecture are realized. The architecture coding is done in hardware description language (HDL) and simulation is executed by using Synopsys VCS and Modelsim 6.5. After verifying register transfer level (RTL) design from VCS, the-architecture were synthesized using Synopsys Design Compiler (DC) and gate level netlists are generated.

Table-1, Table-2 represents the RTL level or prelayout synthesize reports of 4 bit and 8 bit MAC unit respectively. Pre-layout area, timing and power analysis reports are generated systematically during compilation of design in Synopsys DC.



| Pre-layout synthesize r           | [4]                    | Proposed<br>design |        |
|-----------------------------------|------------------------|--------------------|--------|
| Critical path delay analysis (ns) |                        | 2.56               | 2.4    |
| Area analysis (µm)                | Combinational area     | 4178.5             | 3205.3 |
|                                   | Non combinational area | 597.2              | 398.13 |
|                                   | Net interconnect area  | 265.4              | 137.4  |
|                                   | Total area             | 4842.1             | 3939.8 |
| Power analysis (mw)               | Total dynamic power    | 1.35               | 1.26   |
|                                   | Cell leakage power     | 0.024              | 0.019  |

#### Table-2. Synthesize report of 8 bit MAC architecture.

| Architecture Post-layout synthesize report      |                     | [4]                   | Proposed<br>design |        |  |  |
|-------------------------------------------------|---------------------|-----------------------|--------------------|--------|--|--|
| Table-3. Post-layout timing and power analysis. |                     |                       |                    |        |  |  |
| Fower analysis (iii                             | v) (                | Cell leakage power    | 0.024              | 0.019  |  |  |
| Power analysis (mw)                             | Total dynamic power |                       | 1.35               | 1.26   |  |  |
|                                                 |                     | Total area            | 4842.1             | 3939.8 |  |  |
| Area analysis (µm)                              |                     | Net interconnect area |                    | 137.4  |  |  |
|                                                 |                     | di comoniational area | 571.2              | 370.15 |  |  |

Critical path delay (ns)

Total power (µw)

Critical path delay (ns)

Total power (mw)

Total cell area represents the summation of sequential, combinational and interconnect area. Overall 25% reduction in total cell area is achieved from proposed 4 bit Vedic parallel MAC, but for 8bit it is about 19%. Total dynamic power of the circuit is the sum of cell internal power and net switching power, by combining this dynamic power with cell leakage power, total circuit power can be analyzed. Overall 27% reduction in total power can be attained from 4 bit architecture, whereas in 8 bit it is reduced to 7% because of increase in hardware resources. The critical path analysis shows that 28% improvement in delay for 4 bit design and 8% improvement in 8 bit architecture.

4 bit MAC

8 bit MAC

The values recorded in Table-1 and Table-2 largely depends on design constraints and clock constraints applied to Synopsys DC tool. The netlists file and Standard Delay Format (SDF) file generated by this tool are used for post-layout synthesize. Post-layout timing and power analysis is carried out using Synopsys Prime Time (PT), with the help of Standard Parasitic Exchange Format (SPEF) file, SDF file and gate level netlists. SPEF file is generated from Synopsys IC compiler tool during placement and routing stage. Table-3 represent output produced by Synopsys PT. Generally, Post-layout analysis gives more precise value, since we are considering design constraints file. Because of this accurate value, critical path delay of 8 bit proposed architecture is elevated to 10%.

#### 5. CONCLUSIONS

The Vedic parallel MAC unit designed in this paper will carry out efficient multiply and accumulation process, which forms the basis for digital Image processing and signal processing applications. By using Vedic encoder and new PPCA architecture, the overall

delay, area and power of the MAC unit is reduced, which in turn improves the performance.

1.41

245.6

2.63

1.32

1.91

368.8

2.94

1.41

The proposed architecture is synthesized in 90nm CMOS library. Synthesize report from Synopsys PT and DC tools shows that the critical path delay of proposed design is significantly reduced compared to the existing parallel MAC unit. Considerable improvement in power and area is also achieved by this architecture. As input bits of MAC unit increases, a remarkable improvement in hardware resources can be observed. We achieved an absolute improvement of 20-30% and 7-18% respectively for the 4-bit and 8-bit Vedic MAC units, in terms of its total circuit power, critical path delay and cell area.

In future, power dissipation of this proposed MAC unit, can be further reduced by utilizing transistor level power optimization techniques. In addition to this, well-organized DSP processor can be realized by using this Vedic parallel MAC unit.

#### REFERENCES

- [1] J. J. F. Cavanagh, Digital Computer Arithmetic. New York: McGraw-Hill, 1984.
- [2] O. L. MacSorley, "High speed arithmetic in binary computers," Proc. IRE, vol. 49, pp. 67-91, January 1961.
- [3] F. Elguibaly, "A fast parallel multiplier-accumulator using the modified Booth algorithm," IEEE Trans. Circuits Syst., vol. 27, no. 9, pp.902-908, September 2000.
- [4] Young-Ho Seo and Dong-Wook Kim, "A New VLSI Architecture of Parallel Multiplier Accumulator Based



### ¢,

#### www.arpnjournals.com

on Radix-2 Modified Booth Algorithm", IEEE trans. on VLSI Systems, Vol. 18 No. 2, February 2010.

- [5] S. B. K. Tirthaji, Vedic Mathematics or Sixteen Simple Sutras from the Vedas, Motilal Banarsidas, Varanasi (India), 1986.
- [6] H. S. Dhillon and A. Mitra, "A Digital Multiplier Architecture using Urdhva Tiryakbhyam Sutra of Vedic Mathematics", in Sixth International Conf. on Comput., Control and Instrumentation, IEEE 2007, pp. 76-80.
- [7] H. D. Tiwari, G. Gankhuyag, C. M. Kim and Y. B. Cho, "Multiplier design based on ancient indian vedic mathematics," in SoC Design Conf. 2008. ISOCC'08. International, 2008, pp. II-65-II-68.
- [8] Chip-Hong Chang, Jiangmin GU, and Mingyan Zhang, "Ultra Low-Voltage Low-Power CMOS 4-2 and 5-2 Compressors for Fast Arithmetic Circuits", IEEE Trans. Circuits Syst. vol. 51, no. 10, October 2004.
- [9] E.de Angel and E. E. Swartzlander, "Low power parallel multipliers," IEEE workshop on VLSI signal process. IX, October 1996, pp. 199-208.
- [10] Peiman Aliparast, Ziaddin Daie Koozehkanani, Abdolhamid Moallemi Khiavi, Ghader Karimian, Hossein Balazadeh Bahar, "A very high-speed CMOS 4-2 compressor using fully differential current-mode circuit techniques", Springer, Analog Integr Circ Sig Process. 66: 235-243, 2011.
- [11]E. Ladner and M. J. Fischer, "Parallel Prefix Computation", ACM J. vol. 27, No. 4, October 1980, 831-838.