## DEVELOPMENT of 4-BIT FASTER ALU BASED ON FPGA

Prof.Dr.Eng. Imad Hussain Al-Hussaini

Dr. Mohammed Najim Abdullah

Iraqi Commission for Computers & Informatics

**University of Technology** 

Falih Salih Alkhafaji\*\*

Ministry of Industry

## **Abstract**

This paper suggests a developed adder technique called Carry Lookahead Adder (CLA) one of possible solutions based on a 4-bit Fast Arithmetic Logic Unit (ALU), two Mode (Arithmetic/Logic ) functions, and (48) different Operations, to increase the processing speed of an ALU by decreasing gate time delay. The phenomena of ripple carry chain in Ripple Carry Adder (RCA) is an important contributor problem of adder design because every final result depends on the last carry, so the ripple carry adder requires (2n) gate time delay to add two n-bit words. The proposed technique is based on Carry look ahead adder (CLA) to solve this problem. (CLA) structures are considered among the fastest topologies for performing addition because its need only (2(log2 (n)+1)) gate time delay by convert the ripple carry chain into two parameters Propagate (P) and Generate (G), passed to the connections of single bit (adders), then all the binary results (Fi) exit directly independent on a carry chain, so the adder circuit in an (ALU) will have enhancement speed. The final part of this paper is to simulate the proposed design on Xilinx XC4005E series (FPGA) to get the results, then analyze the results by using two different Mode, in order to get the delay time of all the circuit.

## 1. Introduction

The adder cell is the elementary unit of an ALU . An (ALU) is an important part of the (CPU) that carries out arithmetic and logic operations on the operands, so it performs arithmetic operations such as addition subtraction and multiplication of integers and logical operations such us AND, OR, NOT, XOR and other Boolean operations.

when increasing the number (n) of input pins to duplicate, the gate time delay will be duplicated in linear approach. But in (CLA) technique, when increasing the number (n) of input pins to duplicate, the gate time delay will be increased in in logarithm approach. So the (CLA) technique is the fastest adder topology which improve an (ALU) [1,2].

# 2. Proposal technique

## Carry Lookahead Adder (CLA):

The carry look-ahead adder (CLA) one of the fastest methods for addition because its increase the clock rate by generating the Carry-In (Cin) of the various full adder blocks in parallel using additional logic circuitry carry lookahead logic (CLL) and developed the Carry-Out (Cout) in two terms carry generation (G) and carry propagation (P). (G) occurs when the two input bits are logic(1), (P) occurs when either of the input bits are logic(1). in other way the (CLA) removes the ripple carry chain effect in (RCA) adder.

This process is highly parallel, so it can be done very fast. If the numbers to be added are n bits long this takes  $2(\log 2 (n) + 1)$  gate time delay, much

better than the (2n) levels of logic required by ripple calculations in (RCA) technique [3]. So this technique is used to reduce addition time for large binary number (n) of bits by changing the ripple carry chain (Cout) into two terms (P) propagation and (G) generation as presented in equation (2-1) for a full adder.

Cout=PCin+G......2-1

The carry lookahead adders expressions are generated for the (Cout) of a group of bits using 2-level logic and these are then applied in a tree. The key to

this is the use of generate and propagate conditions already introduced as in figure (1).

Generate : (G)= A ● B

Propagate : (P)=  $A \times B$ 



Fig(1): P and G in full adder

For a 4-bit adder the generate (G) and propagate (P) terms are:

while the carries (C out ) from the various stages are:

Substituting C0 in the C1 equation, leads to

The sum for the least significant stage is given by:

In practice, it is not possible to use the (CLA) to realize constant delay for the wider-bit adders since there will be a substantial loading capacitance, and hence larger delay and larger power consumption. The (CLA) has the fastest growing area requirements with respect to the bit size. Figure (2) shows block diagram of (n-bit CLA) circuit.

Since the input carry ripples only once to the output carry with this design, the (n-bit CLA) adder is as fast as the 1-bit (RCA) adder [3],[4],[6].



Figure(2): CLA architecture, Delay independent of the number of bits

# 3. Xilinx Foundation Series with Schematic Program Entry

One of the most interesting aspects of the FPGA technology is that its implementation (hardware) side is totally determined by its description software. Figure(3) shows Xilinx Foundation series (3.1i) CAD software package that is used to synthesize and implement the architecture of the DEVELOPMENT of 4-BIT FASTER ALU to the FPGA chip.



Figure (3): Programming Sequence of Xilinx Foundation Series 3.1i

The first step that has been followed in implementing an (ALU) design is Design Entry.

The Design Entry in the proposed design has been done using Schematic Programing Language.

The second step is Design Synthesis which is used to translate the Schematic Program into the circuit netlist .For this effort, design were targeted to Xilinx (XC4005E) series FPGA. The resulting netlist is then used to produce the configuration bit steam that programs the FPGA device.

The third step in the design flow is Simulation which catch design faults such as incorrect module annotation problems in the data flow and incomplete design descriptions.

The forth step is Design Implementation . It concerned with the exact section of the circuit primitives and their placement and routing with some given constraints. This step has several operations like completes the hardware design, translates the gate level design into hardware primitives available in(XC4005E) sries FPGA , assigns the design to physical locations on the chip and rout the connections between them, timing information about the design, and determines the configuration bits to implement the design .Figure(4) illustates the XC4005E design flow implementation.

At the end, the Floor Planner can be used to know more information about the design connectivity and resource requirments, target FPGA resource layout, and the design mapping locatin constaints, as shown in figure(5) [5].



Figure(4): Flow Engine window for a FPGA target device XC4005E



Fig (5) CLBs with 10 pins connection of 4 - bit faster ALU

# 4. Development

The development in this technique is to reduce the time required for addition by broken the ripple carry chain in (RCA) technique and replace it in a faster method Carry Look-ahead Adder (CLA) which improve the speed of the computation in an (ALU), by reduced the gate timing delay from (2n) to  $2(\log 2(n)+1)$  for n-bit adder, in other word when increasing the number of input (n) to duplicate, the gate time delay will be increased in constant time equal to (2 nano second).

## 4. 1 Design Procedure

An 4-bit Fast ALU has four stages, each stage consisting of three parts: a) input multiplexers b) full adder and c) output multiplexers. An ALU performs the following four arithmetic operations ADD, SUBTRACT, INCREMENT and DECREMENT. The four logical operations performed are EXOR, EXNOR, AND and OR. The procedure of constructing 4-bit Fast ALU began with

- 1. Design 1-bit slice of an ALU using CLA adders technique, so the steps design of 1-bit Fast ALU is:
- a. Construct 1-bit adder using CLA adders technique.
- b. Construct 48-1 MUX's to have each logic/arithmetic of each operation
- c. Connect adder circuit with 48-1 MUX's as shown in figure (6),though 6 control bits(S3,S2,S1,S0,Cin,M) which make (48) different Operations to be performed on the input bits Ai and Bi.
- d. tested and verified for correct functionality and timing characteristics of 1-bit Fast ALU.
- 2. Construct four 1-bit Fast ALU:
- 3. Construct additional logic circuitry 4-bit carry lookahead logic (CLL) as shown in figure (7).

- 4. Connect four 1-bit Fast ALU with 4-bit(CLL) through (Pi) and(Gi) pins as shown in figure (10), this method can be conveniently done using carry generate/propagate signals.
- 5. tested and verified for correct functionality and timing diagram of all the circuit (4-bit Fast ALU) as shown in figure (8),(9).



Figure6: 1-bit Arithmetic Logic Unit



Fig(7): Carry Lookahead Adder CLL circuit



Fig(8): F=(A+A) Arithmetic Operation for 4-bit Fast ALU



Fig(9): F=(A+A+1) Arithmetic Operation for 4-bit Fast ALU.

# **4.2 Implementation**

Figure (10) shows the input and output lines, of the DEVELOPMENT of 4-BIT FASTER ALU put it on the FPGA schematic sheet.



Figure (10) The schematic sheet of 4- bit faster ALU

# 5. FPGA Simulation Result and Design Verification

Synthesized circuit was prototyped on Xilinx Vertix XC4005E series FPGA, and The technique which is used in proposed design to make an (ALU) fast speed process is a Carry Lookahead Adder (CLA), type second level instead of the

old technology (RCA), so there are some parameters will be taken from the FPGA synthisize reports of the DEVELOPMENT of 4-BIT FASTER AL shown below:

- 1. Number of Configurable Logic Block (CLBs) which is used 13 of the total 196 . that means it takes 6% of the total area of an XC4000E series FPGA chip area.
- 2. Number of 4 input Look-up Table (LUTs) which is used (21) of the total number (392) or in ratio (5%) and the number of 3 input (LUTs) which is used (6) of the total number (392) or in ratio (3%).
- 3. Maximum pin delay is 5.690 ns and maximum net delay 5.696 ns.
- 4. gate time delay of (n-bit RCA) equal to (2\*n) gate time delay that's mean the propagation delay of a (4-bit RCA) equal to (8) gate delays.
- 5. Propagation delay of the (4-bit CLA) equal to (2(log2(n)+1)) that's mean the Propagation delay of two level (4-bit CLA) equal to (6) gate delays
- 6. The design of (4-bitFast ALU) requires 15x4 +14 = 74 gates, without computing (ZT) flag.
- 7. The proposed design using (CLA) technique second level is faster than first level, but requires more gates first level.
- 8. From the Post Layout Timing report, there are some important delay time in (ns) between each of pair pins will illustrated as follows:
  - Cin with F3 = 24.725ns
  - Cin with Cout4 = 23.136ns
  - A0 with F3 = 34.837
  - B0 with F3 = 31.328
  - A3 with F3 = 27.876
- 9. From the Post Layout Timing report, Max pair pins delay time of the proposed design equal to (38.547) occurred between (S1 with Cout4) and Min pair pins delay time equal to (21.139) occurred between (Cin with F0).

# 6. Discussion

Figure (11) shows the Delay time measurments in nano sec. betwen (Bi) respect to (F3,Cout,Gout,Pout), we see that maximum delay time respect to (F3) will happened in pin (B2) equal to (33 nano sec.) and minmum delay time will happened in pin (B3) equal to (28 nano sec.) so the difference delay time in pins (B2&B3 with F3) between maximum and minmum equal to (5 nano sec.) with aspect ratio to the maximum delay time equal to (13.8%).

The maximum delay time respect to (Cout) will happened in pin (B0) equal to (37 nano sec.) and minmum delay time will happened in pin (B1) equal to (31 nano sec.) so the difference delay time in pins (B0&B1 with Cout) between maximum and minmum equal to (6 nano sec.) with aspect ratio to the maximum delay time equal to (16.6%).

Figure (12)shows the Delay time in nano sec. betwen (Ai) respect to (F3,Cout,Gout,Pout), we see that maximum delay time respect to (F3) will happened in pin (A0) equal to (35 nano sec.) and minmum time delay will happened in pin (A3) equal to (27.5 nano sec.), so the difference delay time in pins (A0&A3 with F3)between maximum and minmum equal to (7.5 nano sec.) with aspect ratio to the maximum delay time equal to (20.27%).

The maximum time delay respect to (Cout) will happened in pin (A0) equal to (37 nano sec.) and minmum time delay will happened in pin (A1& A2) equal to (29.5 nano sec.), so the difference delay time in pins (A0&A1&A2 with Cout) between maximum and minmum equal to (7.5 nano sec.) with aspect ratio to the maximum delay time equal to (20.27%).

## Time delay in nano



Figure(11): Delay time in nano sec. betwen (Bi) respect to (F3,Cout,Gout,Pout).



Figure (12) Delay time in nano sec. between (Ai) respect to (F3,Cout,Gout,Pout).

## 7. Conclusions

This paper has been presented for DEVELOPMENT of 4-BIT FASTER ALU using Carry Look-ahead (CLA) adder technique in order to increased the speed of an (ALU) and implement it to the Xilinx XC4005E series (FPGA) using Foundation Series software (F3.1i) to bring the simulation results as close as possible to the real time results.

The FPGA maintains the advantages of custom functionality while avoiding the high development costs and the ability to make design modification after production. The final desgin was specified using only schematic entry design.

#### This work is related to:

- 1. An (RCA) adder technique which is too slow for fast addition of large logical values because the delay time of this technique which goes through all carry bits. One problem with a multiple bit (RCA) adder constructed from a set of full adders is that the carry must ripple through, so when increasing the number (n) of input pins to duplicate, the gate time delay will be duplicated in linear.
- 2. To reduce the time required for addition, the ripple carry chain must be broken and replace it with a faster method. Once such method is called Carry Look-ahead Adder (CLA) which used to improve the speed of the computation by generates the carry signals by generating the (P) and (G) signals and the carries are computed in parallel using carry lookahead logic (CLL), so the gate timing delay can be reduced to 2(log2 (n)+1) for n-bit adder.
- 3. An (CLA) adder technique is usually presented as a large and very complicated circuit, which is quite difficult to understand. In contrast, it can be explained by going through a sequence of transformation steps. At each stage there is a specific technical problem to overcome and a clear strategy for solving it.
- 4. In (RCA) adder technique, an (n-bit ALU) which can be designed by concatenating number of (1-bit ALUs).

- 5. In (CLA) adder technique, the gate time delay "independent" of the number of bits.
- 6. Number of (CLBs) which is used equal to (13) of the total (196), that means it takes 6% of the total area of an XC4000E series FPGA chip area.
- 7. Number of equivalent gates which is used equal to (153).

## 8.References

- [1] A.Chandrakasan, W. Bowhill, F. Fox, "Design of High Performance Microprocessors Circuits", IEEE Press, 2000.
- [2] Mano M.Morris, "Computer System Architecture" New Jersey: Prentice-Hall International, Inc. Third Edition",1999.
- [3] Fu-Chiung Cheng, Stephen H. Unger, ,"Self-Timed Carry-Lookahead Adders", IEEE Transactions On Computers, Vol. 49, No. 7, July 2000.
- [4] David J. Grant, Xiuling Wang, "4-bit CMOS Transmission Gate Adder Module", Department of Electrical & Computer EngineeringUniversity of Waterloo, April 14, 2003.
- [5] Daniel C Hoggar, "Leveraging RTL and Physical Synthesis Integration to Achieve Timing Closure in FPGAs", Mentor Graphics Corporation, October 2003.
- [6] Shahrzad Naraghi, "Reduced Swing Domino Techniques for Low Power and High Performance Arithmetic Circuits", University of Waterloo, Canada ,2004.

# تطوير 4-Bit لوحدة الجمع المنطقية الرياضية السريعة باستخدام تقنية FPGA

الدكتور محمد نجم عبدا لله الجامعة التكنولوجية

أ.د. مهندس عماد حسين الحسيني
الهيئة العراقية للحاسبات والمطوماتية

فالح صالح الخفاجي شركة العز- وزارة الصناعة والمعادن

## المستخلص:

ان هذا البحث يقترح عرض تقنية جديدة لدائرة الجامع السريعة وهي (CLA adder) حيث أنها احد التقنيات المستخدمة في التطوير المقترح 4-Bit للبوابة المستخدمة في التطوير المقترح 4-Bit للبوابة المنطقية الرياضية السريعة (ALU) وذلك بتقليل التأخير الزمني للبوابة المنطقية (ALU) أكثر سرعة. إن ظاهرة سلسلة الحمل المتموج (gate time delay) في تقنية الـ(RCA adder) هي من العناصر المهمة التي تبطأ من سرعة إخراج النتائج في الجامع و ذلك بسبب كون النتائج النهائية تعتمد على الحمل (carry) الذي يسبقها, لذلك فأن هذه التقنية تستغرق تأخير الزمني للبوابة المنطقية بمقدار (2n) عند جمع عدان ثنائيان ذات نطاق (n-bit) المتموج حيث تقوم بإخراج النتائج مباشرة دون الاعتماد على هذه الظاهرة وهذا مايجعل هذه النقنية المستخدمة في تصميم دائرة الجمع المنطقية الرياضية تحسن من سرعة عملها كون التأخير الزمني للبوابة المنطقية في عملية الجمع المنطقية تساوي ((20)2) (n)). إن هذا التحسن في السرعة ناتج عن تحويل سلسلة الحمل المتموج إلى إشارتين عمريط بشكل (P) Propagate) (عدم المتموج الى إشارتين على منظومة مكونة من عدد (3-bit adders) تربط بشكل (Connections). الجزء الأخير من البحث يستعرض محاكاة التصميم المقترح على منظومة Series FPGA (المتحدول على النتائج و من ثم تحليلها لكي يتم احتساب التأخير الزمني للدائرة باجمعها .