Gemini Documentation & Lab 2

Indiana University Computer Science Department

B443/543 Computer Architecture Spring 1998

Lab Assignment #2:

This lab assignment will teach you how to microprogram a simple 
microprocessor.  The Gemini microprocessor is intentionally 
designed to have as few performance enhancements as possible.  In 
fact, I believe I have managed to leave all of them out, although 
I could be mistaken. We will use Gemini for the pipelining and 
hazards lab exercise (Assignment #3) coming up.

Gemini is designed to use two BASIC Stamp II chips communicating 
via serial links. This does two things:
	- divides the Gemini simulator into control and data parts
	
	- prepares you for multi-chip simulations of pipelined, VLIW,
	  and superscalar processors, and the 16-node Stamp-based
	  hypercube projects coming later in the semester

The Gemini documentation covers these topics:

	- instruction set architecture (programmer's model)

	- system architecture (microprogrammer's model)

	- datapath (ALU, registers, and memory) and controller
	  (microprogram, microaddress select, and control signals)

	- timing diagrams for latches and memory

	- example microcode for typical instruction

	- explanation of microcode, with diagrams to show
	  how it controls parts of Gemini

	- recommended division of system architecture for simulation

Your assignment is to simulate Gemini using two Stamps, first 
using the example microcode to get your simulator working, then 
writing the rest of the microcode to get the other instructions 
working, and finishing by writing and running the Gemini 
instructions that correspond to this program:

	for x = 1 to 10
	  y := y + x
	next x
	end

You will have questions as you work through this assignment 
because I cannot possibly anticipate the various ways you will 
approach it.  Don't panic!  Just post your questions to the 
newsgroup, and I will reply to them there, so that everybody can 
receive the benefit of the answers.

Gemini Documentation

Programmer's Model

Gemini is an accumulator-based processor with a single bus 
connecting the ALU, the accumulator (ACC) and program counter (PC) 
registers, and main memory.  The bus is used for both address and 
data.  The accumulator is the target of the load, arithmetic and 
logic operations.  It is the source for the store operation. It is 
not affected by branching.  A constant ZERO is used by the load 
instruction, which places the value from memory into the 
accumulator by adding ZERO plus the value into the accumulator. 
ZERO is otherwise inaccessible to the programmer. To clear the 
accumulator to zero, if desired, the programmer must load the 
value zero from memory.

From the programmer's view, Gemini instructions use a 
demultiplexer and a multiplexer to control the target of the 
accumulator, and to select the operand for the machine instruction 
(ZERO for load, the accumulator for arithmetic and logic).

Instruction Set

Gemini has eight instructions:

	- LDA m		Load the accumulator from a memory location
	- STA m		Store the accumulator into a memory location
	- ADD m		Add the value in memory to the accumulator
	- ADC m		Add the value in memory plus the carry to the 
				     accumulator
	- AND m		Logical "and" of memory and accumulator
	- OR  m		Logical "or" of memory and accumulator
	- NOTA			Logical "not" of accumulator
	- BRC m		Branch to memory location if carry set (i.e., 
				     carry = 1)

The instructions have two formats:

	- two bytes (LDA, STA, ADD, ADC, AND, OR, BRC)
	

	- one byte (NOTA)
	

The reserved field in the first byte of the instruction would 
typically be used to extend the Gemini architecture with more 
instructions and/or addressing modes. The temptation of an 
architect to use these bits for unsupported instructions, and the 
ingenuity of users who hunted for and used the unsupported 
instructions, was the bane of early architects, who would drop the 
unsupported instructions only to have the users complain bitterly.

Addressing Modes

Gemini has a single direct-addressing memory mode.  Other 
addressing modes can be constructed by modifying the program in 
memory. This is known as self-modifying code, and was the only way 
to construct many programming constructs in the early days of 
computer architecture that we now take for granted.

Other addressing modes can be encoded in the reserved field, and 
implemented in the microcode.

Memory

Gemini has an eight-bit memory address field.  Thus all code and 
data must fit into a 256 byte main memory.

Microprogrammer's Model

At the microprogrammer's level the programmer's model changes to a 
multiple register machine with the functions of the multiplexer 
and demultiplexer handled by outputing data onto the single bus, 
and latching it into a target register.

The microcontroller and its connection to the ALU, registers, and 
memory are visible.  The architect must write the microcode that 
will cause the eight machine instructions to function correctly.  
As we will see, the Gemini instruction set can be changed by 
revising or adding to the microprogram.

Datapath

The Gemini datapath necessarily includes the ALU, registers, and 
memory because all share a common bus.  Performance improvements 
start by separating the functions common to the bus to parallelize 
the operation of the processor. (This is why the MAR and MDR are 
included in the internal registers, even though they are not 
needed in this most primitive model.) For example, selecting a 
location in memory and reading or writing memory data could be 
done in parallel during a single clock cycle if:

	- the common bus was split into an internal CPU bus and
	  an external memory bus,

	- the external memory bus was in turn split into an address
	  bus and a data bus, and

	- the MAR was connected to the address bus, and the MDR to
	  the data bus

Some additional control, basically some multiplexers and select 
signals, would be needed to specifiy the direction of data 
transfer on the memory data bus, but the increase in performance 
because of parallel operation is worth the cost in complexity and 
floorspace.

ALU

The ALU performs four arithmetic and logic operations:

	- add
	- and
	- or
	- not

The instruction set and all "support" operations, such as 
incrementing the PC, are constructed from these four operations by 
choosing the source of the ALU inputs and the destination of the 
accumulator (ACC) output. Some supporting operations, such as the 
use of the memory address register to select a location in memory, 
do not involve the ALU.

Registers

Gemini has ten internal registers that are implemented as latches, 
three in the ALU group (A, B, and ACC) and seven in the register 
group (ZERO, ONE, PC, MAR, MDR, TEMP, and IR). Their intended 
purpose is listed below:

	- A:	ALU input, target of ACC, TEMP, MAR, MDR, and PC
	- B:	ALU input, target of ZERO or ONE
	- ACC:	ALU output

	- ZERO:constant zero, used for LDA instruction
	- ONE:	constant one, used to increment PC and MAR
	- PC:	program counter
	- MAR:	memory address register
	- MDR:	memory data register
	- TEMP:temporary ACC storage while updating PC and MAR
	- IR:	instruction register, the source of the starting
			address for each machine instruction's microprogram

Each register is "addressed" directly by output (.out) and latch 
enable (.latch) control bits in the current microinstruction. Some 
registers' outputs are always enabled, such as the A and B ALU 
inputs, and the instruction register (IR). In these cases there is 
no output control bit in the microinstruction. As mentioned 
previously, the unoptimized version of Gemini does not need the 
MAR and MDR.  They are included for future use (hint, hint :-)

Memory

The 256 byte main memory has internal address and data latches, 
and address decoding. Since we are not designing the memory, we 
will assume an interface protocol that uses the memory address 
register (MAR) to select the memory location, WRITE/not_read and 
DATA/not_address signals to activate the selected location, and 
the memory data register to either provide the data for a store or 
receive the data during a load.

The operation of the memory interface is summarized in the 
following table:

Microprogrammed Controller

Microcode

The microcode lies at the heart of the microprogrammed controller. 
Control signals for each microinstruction are generated using the 
clock, ø, either for the level-triggered latches in the datapath, 
or the edge-triggered latch in the microcontroller that holds the 
currently executing microinstruction.

The microcode (also known as the microprogram) is a 33-bit wide 
array of 256 microinstructions. It is not at all unusual to have a 
microinstruction width that is not a power of two because there is 
no guarantee that the number of control signals will be a power of 
two, and because floorspace on the CPU is so limited that it would 
be foolish to implement unused bits.

The organization of Gemini and other typical microprograms is 
given in the table below. Is START the beginning or the end of a 
µinstruction sequence?  Is the glass half empty or half full?  :-)

Microinstruction Format

The microinstruction is composed of four groups of control 
signals:

	- ALU group
	- Register group
	- Memory group
	- µController group

The signals in each group cause a latch to put data onto a bus 
(.out), capture data from the bus (.latch), select a function or 
control a multiplexer, or choose the next microinstruction to 
execute. Most control signals are inactive during microinstruction 
execution, that is, most bits in the microprogram are zeros. 
Active signals are conditioned by the clock, ø, to produce the 
output and latch signals.



The format of the Gemini microinstruction is called "horizontal 
microcode" because all control signals are represented by a bit in 
the microinstruction. "Vertical microcode" decodes all control 
signals from a microinstruction (useful if many microinstructions 
are repeated in machine instructions).  "Diagonal microcode" 
encodes some fields in the microinstruction but not others, a 
compromise between space and time.  In the Gemini microinstruction 
the ALU functions could be encoded as a pair of bits.

A microinstruction specifies a limited form of parallelism. As 
many events can occur in parallel as there are functional units 
and buses to process and transmit data without conflict. A good 
microprogrammer searches for ways to reorganize the machine 
instruction's microcode so as to extract as much parallelism as 
possible from the microprogrammer's model.  Experience in 
microprogramming can identify the bottlenecks in instruction 
execution that can be overcome by hardware.  Pipelining is the 
easiest of these to see.

It is possible to output data from two sources simultaneously, but 
because some of the same data bits will be zero and one, damaging 
short circuits will almost certainly occur. Sending data to two 
destinations at the same time is permissible if the source of the 
data is powerful enough (i.e., its "fan-out" is large enough).

Timing: Where Microcode Meets Hardware

The timing diagram below shows the relationship between the clock, 
ø, and the signals that feedback to control the microcontroller, 
or that are sent from the microcontroller to control the datapath.

Clock Generator

The system clock provides the "heartbeat" of the microcontroller. 
The clock speed is determined by the time required for data to 
flow through the longest path in the CPU, or by the time required 
to execute the the most time-consuming ALU operation. This kind of 
design is called "synchronous" because all functions are governed 
by a central clock.  Many instructions are slowed down because the 
clock cycle is the maximum necessary in the worst case.  Another 
type of design called "asynchronous" permits the CPU to operate as 
fast as possible for each operation.  Asynchronous operation is 
similar to dataflow processing, but like it, adds complexity to 
the CPU.

Generating the Next MicroPC

First we will examine the operation of the microcontroller.  The 
most significant function it performs is selecting the start of a 
sequence of microinstructions that implement the function of a 
machine instruction, then sequencing through the microinstructions 
until the machine instruction is complete.

Instruction Register (IR)

The instruction register (IR) is the source of the starting 
address for a sequence of microinstructions. However, since there 
are only three bits in the machine instruction's opcode to specify 
the µaddress, it is necessary to translate the 1-of-8 opcodes to 
one of the 256 µcode addresses.

MicroPC Map Table

The microPC map table translates the opcode in the IR to a 
µinstruction address. Note on the timing diagram that the IR, 
whose output is always enabled, is translated as soon as the data 
is followed by the latch, but is only guaranteed to be valid after 
the data is latched.  This is still more than enough time for the 
next microinstruction to be enabled out of the µcode, and be 
present at the current µinst latch when the rising edge of the 
clock, ø, signals the end of the executing microinstruction by 
latching the next one.

Action of Carry Flag

The BRC machine instruction uses the technique of "bit forcing" or 
"bit jamming" to select between one of two µinstruction addresses 
depending on the state of the carry.  This is accomplished by 
wiring the carry into the least significant bit of the microPC map 
table, but ONLY for the entry that translates the BRC opcode into 
the BRC microinstruction address.  Thus, when the carry is zero 
the first of two adjacent microinstructions is executed.  When the 
carry is one, the second of the two adjacent microinstructions is 
executed. Since every microinstruction includes the address of the 
next µinstruction to be executed, it is easy for the 
microprogrammer to vector one of the microinstructions to the 
appropriate sequence to cause the machine program to branch or to 
fall through to the next instruction.

Selecting the Next µAddress from the µInstruction

As just mentioned, each microinstruction contains the address of 
the next microinstruction to execute.  This is equivalent to 
having a counter register hardwired into the microcontroller. 
While it is space-inefficient, this is also a very fast technique.  
Where the program space is small, as it is in a microprogram, and 
speed is necessary, this is a good design choice.  Larger µcode 
and more complex sequencing in µcode may require a microPC 
implemented as a register, with its own set of µcontroller status 
flags.

Current Microinstruction Latch

The current µinstruction is held in an edge-triggered latch so 
that all control bits are stable throughout the clock cycle. This 
is necessary so that the next microinstruction address will be 
valid when the clock rises at the end of the cycle, latching the 
next microinstruction (see the timing diagram shown earlier).

Generating Control Signals

Most control signals are "and"ed with the clock, ø, to produce 
output/follow and latch signals (see the timing diagram).  Some 
signals are unconditioned, such as the µPC.sel, which is thus 
given as much time as possible to "steer" the correct 
microinstruction address into the microcode ROM.

ALU Timing

The ALU is one of the critical paths in Gemini, determining the 
maximum clock rate by the sum of the time needed to output and 
latch the second input (usually to the B register), compute the 
function of the B register and the previously latched A register, 
and latch the output in the accumulator.  If the ALU operates too 
slowly, then a single operation that performs all these actions 
can be split into two, and the clock rate increased, perhaps even 
doubled. Since there are many more data transfer operations than 
ALU operations, the overall result is a much faster CPU.

Register and Memory Timing

External memory is up to ten times slower than internal memory, 
largely due to the resistance and capacitance of the I/O pads on a 
VLSI circuit.  The Gemini architecture, as used in this 
assignment, ignores differences in register and memory timing, 
assuming that both are not only connected to the same bus, but are 
also on the same chip.

Example Microcode

Drawing on the organization of a typical microprogram given 
previously, an example microprogram with RESET, part of the 
PROLOG, and a machine instruction µcode sequence for the "LDA m" 
instruction is given.

The microcode is illustrated with dataflow diagrams to "animate" 
the operation of the system architecture. The best way to 
understand the microcode, and learn to write your own, is to step 
through the example line-by-line, comparing each line to the 
dataflow diagram to see how output control signals and latch 
control signals are used to route data from source to destination.

For dataflow diagrams where the timing is critical, such as 
decoding the opcode in the IR into a microinstruction, compare the 
dataflow diagram to the timing diagram.  You could, if you want to 
calculate the maximum clock speed for the Gemini, look up the 
timing specification of 74LS373 and 74LS374 latches, and the 
AM2901 Register and ALU slice to get timing data.

Dataflow Diagram Animations of "LDA m" Execution

The processor is reset (exactly how we will not consider), and 
executes the machine instruction at location 0x00 in the Gemini's 
main memory.  THE FOLLOWING ANIMATION ASSUMES THAT THE INSTRUCTION 
IS "LDA m", BUT IT COULD BE ANY INSTRUCTION!

			The critical datapath occurs here,
			when the ONE constant is passed
			through the B latch to be added
			with carry into the accumulator.

				The key step in decoding occurs here,
				when the IR selects an entry in the
				map table, which in turn selects the
				next microinstruction.


The "LDA m" machine-instruction
specific µcode starts here.

			The "LDA m" machine instruction is
			finished, using the point of view
			that the PROLOG must clean up after
			each instruction by incrementing
			the PC if necessary.  This explains
			the branch to START.

			The PROLOG is now finished, and the 
			next Gemini machine instruction's
			address is in the MAR, ready to be
			fetched to start the next sequence
			of µinstructions.