I. Introduction

This project prototype report discusses the implementation of a Discrete Cosine Transform decoder in VLSI. First, the report will review the specifications for the chip, as well as the modifications to the design that have been made since the second interim report. Then, each sub-cell of the block will be covered, giving the implementation details and results of each. The current status of the project is that all sub-cells have been implemented in MAX, and have been verified in both IRSIM and HSPICE.

II. Chip Overview

Overview of function

This chip will compute an approximation of the one dimensional Inverse Discrete Cosine Transformation on eight 10-bit signed integers. The operation of the chip is summarized as follows:



Overview of design

An updated block diagram for the circuit is shown in Figure 1.

The main modification made in the block diagram since the last report was the elimination of the second register file using a pipeline approach. The original block diagram required two register files, because each iteration of the algorithm needs to compute a new 8-integer array using the values stored in another 8-integer array. This design was not ideal because (1) register files consumes more area than any other sub-cell on the chip, so using two register files nearly doubled the size of the circuit, and (2) eight clock cycles were required between iterations of the algorithm to transfer the contents of the second register file back to the first, which was inefficient.

Fortunately, we noticed a clever reordering of the the array computations in the algorithm which eliminatess the need to retain all of the items of the old array. Specifically, our reordering allows a given item in the "old" array to be overwritten with its "new" value after four clock cycles. Therefore we can eliminate the second register file and instead insert three register stages in the datapath. Since the setup time of the registers is about the same as the propagation delay of the shifters, this approach will probably not allow us to use a faster clock rate, which is the usual benefit of pipelining. However, by eliminating the register file, we gain other advantages:



Figure 1: Updated block diagram of IDCT system





III. Sub-cell Design and Implementation

Shifter

Description: Our shifter has four modes of operation: shift 1-bit left, shift 1-bit right, copy input to output, and output 0s. The shifter is in effect a 4:1 multiplexer, choosing between the left bit, right bit, current bit, and ground, depending on the control signals. For implementation, a bit-sliced approach was used, consisting of pass transistors to either connect the output to the left bit, the right bit, the current bit, or to ground.

Input Signals:

Output Signals:

Area: 82.44 x 14.5 micron

Cell Design: Dual PMOS/NMOS pass transistors are used to multiplex between the left bit input, the right bit input, and the current bit input. A single NMOS transistor is activated for the 0 function, which just pulls the output to ground. This arrangement requires 7 control signals: lsh, lshnot, rsh, rshnot, eq, eqnot, and zero. Standard dual CMOS logic is used to convert the FUNC input to the appropriate 7 control signals, using the equations:

Figure 2 shows the transistor-level schematic for one bit-sliced cell, while figure 3 shows the MAX layout for a one-bit cell. Finally, figure 4 shows the full 10-bit shifter layout, including the CMOS circuitry that generates the control signals from the FUNC input.


Figure 2: Transistor-level schematic of 1-bit shifter

Figure 3: Layout of 1-bit shifter




Figure 4: Layout of full 10-bit shifter, with control circuitry



IRSIM simulation: Figures 5 and 6 show the IRSIM command file and simulation results, respectively, for the shifter. The command file tests all four function on two sample 10-bit inputs. As shown in the simulation results, the shifter performs correctly.


vector in i9 i8 i7 i6 i5 i4 i3 i2 i1 i0
vector out o9 o8 o7 o6 o5 o4 o3 o2 o1 o0
vector func f1 f2

w in out func

set in 0010100011
set func 00
print equal func, out should = in
s 1000
set func 01
print zero func, out should = 0
s 1000
set func 10
print lsh func, out should = in left shift 1
s 1000
set func 11
print rsh func, out should = in arithmetic right shift 1
s 1000

set in 1011100010
set func 00
print equal func, out should = in
s 1000
set func 01
print zero func, out should = 0
s 1000
set func 10
print lsh func, out should = in left shift 1
s 1000
set func 11
print rsh func, out should = in arithmetic right shift 1
s 1000

exit  
Figure 5: IRSIM command file for 10-bit shifter


equal func, out should = in
func=00 out=0010100011 in=0010100011
time = 1000.00ns
zero func, out should = 0
func=01 out=0000000000 in=0010100011
time = 2000.00ns
lsh func, out should = in left shift 1
func=10 out=0101000110 in=0010100011
time = 3000.00ns
rsh func, out should = in arithmetic right shift 1
func=11 out=0001010001 in=0010100011
time = 4000.00ns
equal func, out should = in
func=00 out=1011100010 in=1011100010
time = 5000.00ns
zero func, out should = 0
func=01 out=0000000000 in=1011100010
time = 6000.00ns
lsh func, out should = in left shift 1
func=10 out=0111000100 in=1011100010
time = 7000.00ns
rsh func, out should = in arithmetic right shift 1
func=11 out=1101110001 in=1011100010
time = 8000.00ns  
Figure 6: IRSIM simulation results for 10-bit shifter



HSPICE simulation: Figures 7 and 8 (figure 8 is located in appendix A) show the HSPICE command file and simulation results, respectively, for the shifter. The command file performs the same operations as the IRSIM command file described above. The HSPICE results confirm that the shifter does work as expected. We found that the worst-case delay is 948 pS. This occurs when the control signals change to left, right, or equal, and the input bit changes to 1, so that the output bit must be charged to 1.


VDD 3.3
CLK 10.0
RISE 0.5
FALL 0.5
i0 11110000
i1 11111111
i2 00000000
i3 00000000
i4 00000000
i5 11111111
i6 00001111
i7 11111111
i8 00000000
i9 00001111
f1 00110011
f2 01010101 
Figure 7: HSPICE command file for the 10-bit shifter





Adder/Subtractor

Description: Our adder/subtractor performs 2's complement addition and subtraction on 10-bit signed integers. We used a simple ripple-carry adder, since more advanced types of adders would not provide much of a performance boost with only 10-bit arithmetic.

Input Signals:

Output Signals:

Area: 87.47 x 21.3 micron

Cell Design: We used the static CMOS mirror adder presented in Rabaey p. 391. We refer the reader to this reference for a transistor-level diagram of the circuit. To allow subtraction, the B input is xor'ed with FUNC, and the carry-in of the adder is FUNC. This effectively takes the 2's complement of B when FUNC is 1, and performs normal addition otherwise. Figure 9 shows the layout of a 1-bit full adder/subtractor, and figure 10 shows the layout of the whole 10-bit unit.




Figure 9: Layout of 1-bit adder/subtractor



Figure 10: Layout of full 10-bit adder/subtractor



IRSIM simulation: Figures 11 and 12 show the IRSIM command file and simulation results, respectively, for the adder/subtractor. The command file tests the addition and subtraction operations on several sample integers. The results show that the module performs correctly.

vector as a9 a8 a7 a6 a5 a4 a3 a2 a1 a0
vector bs b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
vector os o9 o8 o7 o6 o5 o4 o3 o2 o1 o0

w as bs func funcnot os

print 300 + 200 should equal 500
print (0100101100 + 0011001000 should equal 0111110100)
l func
h funcnot
set as 0100101100
set bs 0011001000
s 1000

print 300 - 200 should equal 100
print (0100101100 - 0011001000 should equal 0001100100)
h func
l funcnot
set as 0100101100
set bs 0011001000
s 1000

print 300 - -200 should equal 500
print (0100101100 - 1100111000 should equal 0111110100)
h func
l funcnot
set as 0100101100
set bs 1100111000
s 1000

print 200 - 300 should equal -100
print (0011001000 - 0100101100 should equal 1110011100)
h func
l funcnot
set as 0011001000
set bs 0100101100
s 1000

print 23 + 456 should equal 479
print (0000010111 + 0111001000 should equal 0111011111
l func
h funcnot
set as 0000010111
set bs 0111001000
s 1000
Figure 11: IRSIM command file for 10-bit adder/subtractor


300 + 200 should equal 500
(0100101100 + 0011001000 should equal 0111110100)
os=0111110100 bs=0011001000 as=0100101100 funcnot=1 func=0
time = 1000.00ns
300 - 200 should equal 100
(0100101100 - 0011001000 should equal 0001100100)
os=0001100100 bs=0011001000 as=0100101100 funcnot=0 func=1
time = 2000.00ns
300 - -200 should equal 500
(0100101100 - 1100111000 should equal 0111110100)
os=0111110100 bs=1100111000 as=0100101100 funcnot=0 func=1
time = 3000.00ns
200 - 300 should equal -100
(0011001000 - 0100101100 should equal 1110011100)
os=1110011100 bs=0100101100 as=0011001000 funcnot=0 func=1
time = 4000.00ns
23 + 456 should equal 479
(0000010111 + 0111001000 should equal 0111011111
os=0111011111 bs=0111001000 as=0000010111 funcnot=1 func=0
time = 5000.00ns        
Figure 12: IRSIM simulation results for 10-bit adder/subtractor

HSPICE simulation: Figures 13 and 14 (figure 14 is located in appendix A) show the HSPICE command file and simulation results, respectively, for the adder/subtractor. The command file performs the same operations as the IRSIM command file described above. The HSPICE results confirm that the adder/subtractor does work as expected.

The worst-case delay for a ripple carry adder is when the most significant bit requires a carry in rippling from the least significant bit. In our case 1 - 1 is the worst case. Using an HSPICE simulation, we found the worst case delay to be 3.88 ns.

VDD 3.3
CLK 10.0
RISE 0.5
FALL 0.5
a9 0000000
a8 1110000
a7 0001000
a6 0001000
a5 1110000
a4 0000100
a3 1111000
a2 1110100
a1 0000100
a0 0000111
b9 0010011
b8 0011111
b7 1100111
b6 1100111
b5 0011011
b4 0010011
b3 1111111
b2 0001011
b1 0000011
b0 0000011
func 0111001
funcnot 1000110 
Figure 13: HSPICE command file for the 10-bit adder/subtractor





Register File

Description: Our register file consists of 8 10-bit registers. On a given clock cycle, any register may be written to, and any two registers may be read from. All registers may be cleared to 0 using an asynchronous active-low reset. We use the positive-edge-triggered D-flip-flop developed in Lab #3 for our register file, with some modifications, including an output enable and two output ports.

Input Signals:

Output Signals:

Area: 199.6 x 174.9 micron

Cell Design: The basic building block of the register file is the D-flip-flop developed in Lab #3. It is a positive-edge-triggered, active-low-reset D flip-flop. A master-slave approach with two D latches are used to create the flip-flop. Since we are using D latches and not JK latches, the flip-flop does not suffer from the input glitch problem that occurs with JK latches in master-slave arrangements. The reader is referred to our Lab #3 report for more details about the basic D flip-flop. The D-flip-flop was modified to have two output ports which can be enabled or tri-stated based on control inputs. Also a write enable input was added. In both cases, NMOS/PMOS pass transistors were used to implement this new functionality, requiring 6 additional transistors total. Figure 15 shows the updated transistor diagram for our D flip-flops, while figure 16 shows the MAX layout.

The register file is essentially an 8x10 array of D flip-flops. The enable control signals for each row are tied together. The output signals for each column are tied together, as are the data inputs. All clocks and resets of all D-flip-flops are connected. Finally, three 3:8 decoder are used to generate the two read enable signals and write enable signal for each row of the register file. A simple static dual CMOS implementation of the decoder is used, using inverters and NAND gates. Figure 17 shows the MAX layout of the register file. The 8x10 array of D-flip-flops is clearly visible, and the portion on the left are the decoders.


Figure 15: Schematic of D flip-flop




Figure 16: Layout of D flip-flop




Figure 17: Layout of register file



IRSIM simulation: Figures 18 and 19 show the IRSIM command file and simulation results, respectively, for the register file. The command file performs a sequence of writes to different registers in the register file, and then reads out the contents of the registers. As can be seen from the output, the register file works correctly in doing these operations.




vector din d9 d8 d7 d6 d5 d4 d3 d2 d1 d0
vector qa qa9 qa8 qa7 qa6 qa5 qa4 qa3 qa2 qa1 qa0
vector qb qb9 qb8 qb7 qb6 qb5 qb4 qb3 qb2 qb1 qb0
vector radda ra2 ra1 ra0
vector raddb rb2 rb1 rb0
vector w w2 w1 w0
w din qa qb radda raddb w reset clk wenb

l reset
l clk
l wenb
set din 0000000000
set w 000
set radda 000
set raddb 000

stepsize 1000
clock clk 0 1
c
c
c
h reset
print all registers should be cleared
c
set wenb 1
print write 1110100101 to register 0
set din 1110100101
set w 000
c
print write 1000001010 to register 1
set din 1000001010
set w 001
c 
print write 0111100001 to register 2
set din 0111100001
set w 010
c
print write 0101010000 to register 3
set din 0101010000
set w 011
c
print write 0011111101 to register 4
set din 0011111101
set w 100
c
print write 0010010010 to register 5
set din 0010010010
set w 101
c
print write 0100100100 to register 6
set din 0100100100
set w 110
c
print write 1000001000 to register 7
set din 1000001000
set w 111
c
print don't do anything (test wenb)
l wenb
set w 000
c
print output reg0 on port a, reg1 on port b
print should be a= 1110100101, b= 1000001010
set radda 000
set raddb 001
c
print output reg2 on port a, reg3 on port b
print should be a= 0111100001, b= 0101010000
set radda 010
set raddb 011
c
print output reg4 on port a, reg5 on port b
print should be a= 0011111101, b= 0010010010
set radda 100
set raddb 101
c
print output reg6 on port a, reg7 on port b
print should be a= 0100100100, b= 1000001000
set radda 110
set raddb 111
c   
Figure 18: IRSIM command file for register file

w=000 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0000000000 wenb=0 
clk=1 reset=0 
time = 2000.00ns
w=000 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0000000000 wenb=0 
clk=1 reset=0 
time = 4000.00ns
w=000 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0000000000 wenb=0 
clk=1 reset=0 
time = 6000.00ns
all registers should be cleared 
w=000 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0000000000 wenb=0 
clk=1 reset=1 
time = 8000.00ns
(regfile2.com,25): wenb: No such vector
write 1110100101 to register 0 
w=000 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=1110100101 wenb=0 
clk=1 reset=1 
time = 10000.00ns
write 1000001010 to register 1 
w=001 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=1000001010 wenb=0 
clk=1 reset=1 
time = 12000.00ns
write 0111100001 to register 2 
w=010 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0111100001 wenb=0 
clk=1 reset=1 
time = 14000.00ns
write 0101010000 to register 3 
w=011 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0101010000 wenb=0 
clk=1 reset=1 
time = 16000.00ns
write 0011111101 to register 4 
w=100 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0011111101 wenb=0 
clk=1 reset=1 
time = 18000.00ns
write 0010010010 to register 5 
w=101 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0010010010 wenb=0 
clk=1 reset=1 
time = 20000.00ns
write 0100100100 to register 6 
w=110 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=0100100100 wenb=0 
clk=1 reset=1 
time = 22000.00ns
write 1000001000 to register 7 
w=111 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=1000001000 wenb=0 
clk=1 reset=1 
time = 24000.00ns
don't do anything (test wenb) 
w=000 raddb=000 radda=000 qb=0000000000 qa=0000000000 din=1000001000 wenb=0 
clk=1 reset=1 
time = 26000.00ns
output reg0 on port a, reg1 on port b 
should be a= 1110100101, b= 1000001010 
w=000 raddb=001 radda=000 qb=1000001010 qa=0000000000 din=1000001000 wenb=0 
clk=1 reset=1 
time = 28000.00ns
output reg2 on port a, reg3 on port b 
should be a= 0111100001, b= 0101010000 
w=000 raddb=011 radda=010 qb=0101010000 qa=0111100001 din=1000001000 wenb=0 
clk=1 reset=1 
time = 30000.00ns
output reg4 on port a, reg5 on port b 
should be a= 0011111101, b= 0010010010 
w=000 raddb=101 radda=100 qb=0010010010 qa=0011111101 din=1000001000 wenb=0 
clk=1 reset=1 
time = 32000.00ns
output reg6 on port a, reg7 on port b 
should be a= 0100100100, b= 1000001000 
w=000 raddb=111 radda=110 qb=1000001000 qa=0100100100 din=1000001000 wenb=0 
clk=1 reset=1 
time = 34000.00ns

Figure 19: IRSIM simulation results for register file


HSPICE simulation: Figures 20 and 21 (figure 21 is located in appendix A) show the HSPICE command file and simulation results, respectively, for the register file. The command file performs approximately the the same operations as the IRSIM command file described above, except that the operations have been simplified to allow for faster simulation time. In the HSPICE command file, register 0 through 4 are loaded with the same values as in the IRSIM command file, and then these four registers are output on both the a and b ports in sequence. The results are correct and can be verified by comparing the outputs to the values input to the registers (shown in figures 18 and 19). Note: To promote readability, only the output signals are shown in the NST graph.

The worst-case delay for the register file is 2.76 ns, which corresponds to the time it takes a 1 to propagate to the output, after the read address lines have changed.

VDD 3.3
CLK 10.0
RISE 0.5
FALL 0.5
d9 001100000110000
d8 001011001000000
d7 001010110000000
d6 000011100000000 
d5 001010101000000 
d4 000001110000000 
d3 000100100110000 
d2 001000101000000
d1 000100010000000
d0 001010100000000
ra2 000000000000011
ra1 000000000000101
ra0 000000000000000
rb2 000000000000011
rb1 000000000000101
rb0 000000000001111
w2 000000111100000
w1 000011001100000
w0 000101010100000
reset 001111111111111
wenb 001111111100000
Figure 20: HSPICE command file for the register file







Control Unit

Description: The control unit is a finite state machine which generates the control signals, instructing the various sub-cells to interact correctly. The control unit is composed of a counter and a ROM. The control unit can be reset to its initial state by pulling its RESET input low. As long as RESET is high and clock pulses are provided, the control unit moves through the words of the ROM, outputting one per cycle. Since the IDCT requires the same steps in the same order regardless of input, there is no input to the finite state machine. The machine always moves through its states in the same order.

Input Signals:

Output Signals:

Area: total: 65.7 x 111.2 micron (individual sub-cells: address incrementer: 41.39 x 11.69; address register: 65.7 x 21.14; ROM: 48.94 x 76.33)

Cell Design: The first step in designing the control unit was to write the microcode needed by the circuit. Figure 22 (located in Appendix A) shows the microcode, in binary format with human-readable comments on the right. A 49 word ROM with 18 bit words is required to store the microcode. A Psuedo-NMOS NOR ROM was implemented, similar to the one presented in Rabaey pp. 562-563. For the address decoder of the ROM, we used a NOR decoder similar to that present in Rabaey pp. 592. We refer the reader to these pages to see transistor schematics and layouts of the individual cells. We implemented Pseudo-NMOS versions of both the ROM and the decoder instead of dynamic logic, meaning that the gate of the pull-up transistor is always grounded. Although this is non-ideal from a power perspective (since a path from VDD to ground exists), we decided to do this for the sake of simplicity.

The ROM was programmed by hand using MAX, by deleting contacts and transistors whenever a 1 was desired. The ROM layout in MAX is shown in figure 23. The left portion of the ROM is the address decoder, and the right portion is the ROM itself.

Since we require 49 words in the ROM, we need 6-bit addresses. The 6-bit counter used in the control unit is based on the 4-bit counter implemented for our Lab #3 assignment. It consists of an increment unit, which contains 6 bit-sliced half-adders, and a 6-bit register similar to the ones used in the register file. We refer the reader to our Lab #3 report for more information about the design of the counter.

Figure 24 shows the full layout of the control unit.






Figure 23: Layout of ROM






Figure 24: Layout of control unit



IRSIM simulation: Figures 25 and 26 show the IRSIM command file and simulation results, respectively, for the control unit. The command file is simple a sequence of clock pulses, so that the control module goes through all of its 49 states. The outputs have been verified as correct by comparing them to the microcode shown in figure 22.


vector add a5 a4 a3 a2 a1 a0
vector out o0 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 o17

w add out

stepsize 1000
clock clk 0 1
l reset
c
c

h reset
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c                              
c
c
c
c
c
c
c
c
c
c
c
c   
Figure 25: IRSIM command file for the control module




out=000100000000000000 add=000000 
time = 2000.00ns
out=000100000000000000 add=000000 
time = 4000.00ns
out=001000000000000000 add=000001 
time = 6000.00ns
out=010000000000000000 add=000010 
time = 8000.00ns
out=011000000000000000 add=000011 
time = 10000.00ns
out=000000000000000000 add=000100 
time = 12000.00ns
out=011100000000110000 add=000101 
time = 14000.00ns
out=010100100100110000 add=000110 
time = 16000.00ns
out=001110110000111100 add=000111 
time = 18000.00ns
out=100001101010111110 add=001000 
time = 20000.00ns
out=100101001110111010 add=001001 
time = 22000.00ns
out=101111011100000000 add=001010 
time = 24000.00ns
out=111011111000000000 add=001011 
time = 26000.00ns
out=110010010100111010 add=001100 
time = 28000.00ns
out=110100000100001000 add=001101 
time = 30000.00ns
out=111100100000001000 add=001110 
time = 32000.00ns
out=100101100000011010 add=001111 
time = 34000.00ns
out=100011011100001000 add=010000 
time = 36000.00ns
out=101011011100000010 add=010001 
time = 38000.00ns
out=111000100000011100 add=010010 
time = 40000.00ns
out=100110010100001100 add=010011 
time = 42000.00ns
out=111110010100010000 add=010100 
time = 44000.00ns
out=101000000000011110 add=010101 
time = 46000.00ns
out=100001000000011100 add=010110 
time = 48000.00ns
out=101100000000010000 add=010111 
time = 50000.00ns
out=101000100000010000 add=011000 
time = 52000.00ns
out=110111011100000000 add=011001 
time = 54000.00ns
out=100110101100000000 add=011010 
time = 56000.00ns
out=101010101100001110 add=011011 
time = 58000.00ns
out=111111011100001100 add=011100 
time = 60000.00ns
out=100001010000001110 add=011101 
time = 62000.00ns
out=110001010000001100 add=011110 
time = 64000.00ns
out=111101001100001110 add=011111 
time = 66000.00ns
out=100001001100001100 add=100000 
time = 68000.00ns
out=110000000100001110 add=100001 
time = 70000.00ns
out=100000000100001110 add=100010 
time = 72000.00ns
out=100111011100001100 add=100011 
time = 74000.00ns
out=111111011100001100 add=100100 
time = 76000.00ns
out=101110010100001110 add=100101 
time = 78000.00ns
out=110110010100001110 add=100110 
time = 80000.00ns
out=111000000000011100 add=100111 
time = 82000.00ns
out=101100100000011100 add=101000 
time = 84000.00ns
out=110001000000010001 add=101001 
time = 86000.00ns
out=100001100000010000 add=101010 
time = 88000.00ns
out=100010000000010000 add=101011 
time = 90000.00ns
out=100010100000010000 add=101100 
time = 92000.00ns
out=100011000000010000 add=101101 
time = 94000.00ns
out=100011100000010000 add=101110 
time = 96000.00ns
out=100000000000000000 add=101111 
time = 98000.00ns
out=100000000000000000 add=110000 
time = 100000.00ns
Figure 26: IRSIM simulation results for control module



HSPICE simulation: The HSPICE command file and simulation results for the control module are shown in figures 27 and 28 (figure 28 is located in Appendix A), respectively. The command file simply takes the control unit through its first 10 states, showing the first 10 control outputs. The results have been verified by comparing them to the microcode shown in figure 22.

The maximum propagation delay for the ROM was found to be 684.21 pS, corresponding to a high-to-low transition in the output of a bit in the ROM. We also noted that the voltage swing on the outputs was reduced to 1.0 volts for low and 3.3 volts for high. This is because we are using a form of ratioed logic. For our purposes, 1.0 volts is a low enough value to prevent noise problems. We will introduce a buffer stage when we connect the ROM's outputs to the external circuitry in order to prevent noise problems.

VDD 3.3
CLK 10.0
RISE 0.5
FALL 0.5
clk 010101010101010101010101010101
reset 0011111111111111111111
Figure 27: HSPICE command file for control module





Multiplexer

Description: Our block diagram calls for a multiplexer, to allow the register file to be written to by either the chip inputs or by the final register of the pipeline. However, an explicit multiplexer is not really needed. The registers have enable inputs, so we can simply insert a pass transistor tri-state buffer on the chip input signals. The multiplexer is then replaced by simply turning on either the enable for the register or the enable for the tri-state buffer.

Appendix A: Oversize Figures


Figure 8: HSPICE simulation results for the 10-bit shifter


Figure 14: HSPICE simulation results for the adder/subtractor


Figure 21: HSPICE simulation results for the register file

M R R   R   R   S  S  S  A R
U F F   F   F   A  B  C  D E
X A A   A   A            D A
S L W   R   R            M D
E D A   A   A            D Y
L       0   1

#### load data from input (8 cycles)
0 1 001 xxx xxx xx xx xx x 0
0 1 010 xxx xxx xx xx xx x 0 
0 1 100 xxx xxx xx xx xx x 0 
0 1 110 xxx xxx xx xx xx x 0 
0 1 000 xxx xxx xx xx xx x 0
0 1 111 000 000 00 11 xx x 0         # R0a <- Rf0, R0b <- Rf0/2
0 1 101 001 001 00 11 xx 0 0         # R0a <- Rf1, R0b <- Rf1/2, R1 <- R0a+R0b

0 1 011 101 100 00 11 11 0 0 # s2[3] # R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf5, R0b <- Rf4/2
1 1 000 011 010 10 11 11 1 0 # s2[6] # Rf0 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf3*2, R0b <- Rf2/2 
1 1 001 010 011 10 11 10 1 0 # s2[4] # Rf1 <- R2, R2 <- R1*2, R1 <- R0a-R0b, R0a <- Rf2*2, R0b <- Rf3/2
1 1 011 110 111 00 00 00 0 0 # s2[5] # Rf3 <- R2, R2 <- R1, R1 <- R0a+R0b, R0a <- Rf6, R0b <- Rf7
1 1 110 111 110 00 00 00 0 0 # s2[7] # Rf6 <- R2, R2 <- R1, R1 <- R0a+R0b, R0a <- Rf7, R0b <- Rf6
1 1 100 100 101 00 11 10 1 0 # s2[1] # Rf4 <- R2, R2 <- R1*2, R1 <- R0a-R0b, R0a <- Rf4, R0a <- Rf5/2
1 1 101 000 001 00 00 10 0 0 # s2[0] # Rf5 <- R2, R2 <- R1*2, R1 <- R0a+R0b, R0a <- Rf0, R0b <- Rf1
1 1 111 001 000 00 00 10 0 0 # s2[2] # Rf7 <- R2, R2 <- R1*2, R1 <- R0a+R0b, R0a <- Rf1, R0b <- Rf0

1 1 001 011 xxx 00 01 10 1 0 # s3[6] # Rf1 <- R2, R2 <- R1*2, R1 <- R0a-R0b, R0a <- Rf3, R0b <- 0
1 1 000 110 111 00 00 10 x 0 # s3[1] # Rf0 <- R2, R2 <- R1*2, R1 <- R0a+R0b, R0a <- Rf6, R0b <- Rf7
1 1 010 110 111 00 00 00 1 0 # s3[7] # Rf2 <- R2, R2 <- R1, R1 <- R0a-R0b, R0a <- Rf6, R0b <- Rf7
1 1 110 001 xxx 00 01 11 0 0 # s3[4] # Rf6 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf1, R0b <- 0
1 1 001 100 101 00 00 11 x 0 # s3[0] # Rf1 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf4, R0b <- Rf5
1 1 111 100 101 00 01 00 0 0 # s3[3] # Rf7 <- R2, R2 <- R1, R1 <- R0a+R0b, R0a <- Rf4, R0b <- Rf5
1 1 010 000 xxx 00 01 11 1 0 # s3[2] # Rf2 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf0, R0b <- 0
1 1 000 010 xxx 00 01 11 x 0 # s3[5] # Rf0 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf2, R0b <- 0

1 1 011 000 xxx 00 01 00 x 0 # s5[1] # Rf3 <- R2, R2 <- R1, R1 <- R0a+R0b, R0a <- Rf0, R0b <- 0
1 1 010 001 xxx 00 01 00 x 0 # s5[5] # Rf2 <- R2, R2 <- R1, R1 <- R0a+R0b, R0a <- Rf1, R0b <- 0
1 1 101 110 111 00 00 00 x 0 # s5[3] # Rf5 <- R2, R2 <- R1, R1 <- R0a+R0b, R0a <- Rf6, R0b <- Rf7
1 1 001 101 011 00 00 00 0 0 # s5[6] # Rf1 <- R2, R2 <- R1, R1 <- R0a+R0b, R0a <- Rf5, R0b <- Rf3
1 1 010 101 011 00 00 11 1 0 # s5[2] # Rf5 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf5, R0b <- Rf3
1 1 111 110 111 00 00 11 0 0 # s5[7] # Rf3 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf6, R0b <- Rf7
1 1 000 010 100 00 00 11 1 0 # s5[0] # Rf6 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf2, R0b <- Rf4
1 1 100 010 100 00 00 11 0 0 # s5[4] # Rf2 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf2, R0b <- Rf4

1 1 111 010 011 00 00 11 1 0 # s6[2] # Rf7 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf2, R0b <- Rf3
1 1 000 010 011 00 00 11 0 0 # s6[1] # Rf0 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf2, R0b <- Rf3
1 1 100 000 001 00 00 11 1 0 # s6[7] # Rf4 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf0, R0b <- Rf1
1 1 000 000 001 00 00 11 1 0 # s6[0] # Rf2 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf0, R0b <- Rf1
1 1 001 110 111 00 00 11 0 0 # s6[5] # Rf1 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf6, R0b <- Rf7
1 1 111 110 111 00 00 11 0 0 # s6[6] # Rf7 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf6, R0b <- Rf7
1 1 011 100 101 00 00 11 1 0 # s6[3] # Rf0 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf4, R0b <- Rf5
1 1 101 100 101 00 00 11 1 0 # s6[4] # Rf5 <- R2, R2 <- R1/2, R1 <- R0a-R0b, R0a <- Rf4, R0b <- Rf5

1 1 110 000 xxx 00 01 11 0 0 #       # Rf6 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf0, R0b <- 0  
1 1 011 001 xxx 00 01 11 x 0 #       # Rf3 <- R2, R2 <- R1/2, R1 <- R0a+R0b, R0a <- Rf1, R0b <- 0  
1 1 100 010 xxx 00 01 00 x 1 #       # Rf4 <- R2, R2 <- R1, R1 <- R0a+R0b, R0a <- Rf2, R0b <- 0  
1 0 000 011 xxx 00 01 00 x 0 #       # R2 <- R1, R1 <- R0a+R0b, R0a <- Rf3, R0b <- 0  
1 0 000 100 xxx 00 01 00 x 0 #       # R2 <- R1, R1 <- R0a+R0b, R0a <- Rf4, R0b <- 0  
1 0 000 101 xxx 00 01 00 x 0 #       # R2 <- R1, R1 <- R0a+R0b, R0a <- Rf5, R0b <- 0  
1 0 000 110 xxx 00 01 00 x 0 #       # R2 <- R1, R1 <- R0a+R0b, R0a <- Rf6, R0b <- 0  
1 0 000 111 xxx 00 01 00 x 0 #       # R2 <- R1, R1 <- R0a+R0b, R0a <- Rf7, R0b <- 0  
1 0 000 xxx xxx xx xx 00 x 0 #       # R2 <- R1, R1 <- R0a+R0b
1 0 000 xxx xxx xx xx 00 x 0 #       # R2 <- R1

Figure 22: Microcode for the control unit


Figure 28: HSPICE simulation results for control module