Indiana University Computer Science Department
B443/543 Computer Architecture Spring 1998
Lab Assignment #2:
This lab assignment will teach you how to microprogram a simple microprocessor. The Gemini microprocessor is intentionally designed to have as few performance enhancements as possible. In fact, I believe I have managed to leave all of them out, although I could be mistaken. We will use Gemini for the pipelining and hazards lab exercise (Assignment #3) coming up. Gemini is designed to use two BASIC Stamp II chips communicating via serial links. This does two things: - divides the Gemini simulator into control and data parts - prepares you for multi-chip simulations of pipelined, VLIW, and superscalar processors, and the 16-node Stamp-based hypercube projects coming later in the semester The Gemini documentation covers these topics: - instruction set architecture (programmer's model) - system architecture (microprogrammer's model) - datapath (ALU, registers, and memory) and controller (microprogram, microaddress select, and control signals) - timing diagrams for latches and memory - example microcode for typical instruction - explanation of microcode, with diagrams to show how it controls parts of Gemini - recommended division of system architecture for simulation Your assignment is to simulate Gemini using two Stamps, first using the example microcode to get your simulator working, then writing the rest of the microcode to get the other instructions working, and finishing by writing and running the Gemini instructions that correspond to this program: for x = 1 to 10 y := y + x next x end You will have questions as you work through this assignment because I cannot possibly anticipate the various ways you will approach it. Don't panic! Just post your questions to the newsgroup, and I will reply to them there, so that everybody can receive the benefit of the answers.
Gemini Documentation
Programmer's Model
Gemini is an accumulator-based processor with a single bus connecting the ALU, the accumulator (ACC) and program counter (PC) registers, and main memory. The bus is used for both address and data. The accumulator is the target of the load, arithmetic and logic operations. It is the source for the store operation. It is not affected by branching. A constant ZERO is used by the load instruction, which places the value from memory into the accumulator by adding ZERO plus the value into the accumulator. ZERO is otherwise inaccessible to the programmer. To clear the accumulator to zero, if desired, the programmer must load the value zero from memory. From the programmer's view, Gemini instructions use a demultiplexer and a multiplexer to control the target of the accumulator, and to select the operand for the machine instruction (ZERO for load, the accumulator for arithmetic and logic).Instruction Set
Gemini has eight instructions: - LDA m Load the accumulator from a memory location - STA m Store the accumulator into a memory location - ADD m Add the value in memory to the accumulator - ADC m Add the value in memory plus the carry to the accumulator - AND m Logical "and" of memory and accumulator - OR m Logical "or" of memory and accumulator - NOTA Logical "not" of accumulator - BRC m Branch to memory location if carry set (i.e., carry = 1) The instructions have two formats: - two bytes (LDA, STA, ADD, ADC, AND, OR, BRC) - one byte (NOTA) The reserved field in the first byte of the instruction would typically be used to extend the Gemini architecture with more instructions and/or addressing modes. The temptation of an architect to use these bits for unsupported instructions, and the ingenuity of users who hunted for and used the unsupported instructions, was the bane of early architects, who would drop the unsupported instructions only to have the users complain bitterly.Addressing Modes
Gemini has a single direct-addressing memory mode. Other addressing modes can be constructed by modifying the program in memory. This is known as self-modifying code, and was the only way to construct many programming constructs in the early days of computer architecture that we now take for granted. Other addressing modes can be encoded in the reserved field, and implemented in the microcode.Memory
Gemini has an eight-bit memory address field. Thus all code and data must fit into a 256 byte main memory.
Microprogrammer's Model
At the microprogrammer's level the programmer's model changes to a multiple register machine with the functions of the multiplexer and demultiplexer handled by outputing data onto the single bus, and latching it into a target register. The microcontroller and its connection to the ALU, registers, and memory are visible. The architect must write the microcode that will cause the eight machine instructions to function correctly. As we will see, the Gemini instruction set can be changed by revising or adding to the microprogram.Datapath
The Gemini datapath necessarily includes the ALU, registers, and memory because all share a common bus. Performance improvements start by separating the functions common to the bus to parallelize the operation of the processor. (This is why the MAR and MDR are included in the internal registers, even though they are not needed in this most primitive model.) For example, selecting a location in memory and reading or writing memory data could be done in parallel during a single clock cycle if: - the common bus was split into an internal CPU bus and an external memory bus, - the external memory bus was in turn split into an address bus and a data bus, and - the MAR was connected to the address bus, and the MDR to the data bus Some additional control, basically some multiplexers and select signals, would be needed to specifiy the direction of data transfer on the memory data bus, but the increase in performance because of parallel operation is worth the cost in complexity and floorspace.ALU
The ALU performs four arithmetic and logic operations: - add - and - or - not The instruction set and all "support" operations, such as incrementing the PC, are constructed from these four operations by choosing the source of the ALU inputs and the destination of the accumulator (ACC) output. Some supporting operations, such as the use of the memory address register to select a location in memory, do not involve the ALU.Registers
Gemini has ten internal registers that are implemented as latches, three in the ALU group (A, B, and ACC) and seven in the register group (ZERO, ONE, PC, MAR, MDR, TEMP, and IR). Their intended purpose is listed below: - A: ALU input, target of ACC, TEMP, MAR, MDR, and PC - B: ALU input, target of ZERO or ONE - ACC: ALU output - ZERO:constant zero, used for LDA instruction - ONE: constant one, used to increment PC and MAR - PC: program counter - MAR: memory address register - MDR: memory data register - TEMP:temporary ACC storage while updating PC and MAR - IR: instruction register, the source of the starting address for each machine instruction's microprogram Each register is "addressed" directly by output (.out) and latch enable (.latch) control bits in the current microinstruction. Some registers' outputs are always enabled, such as the A and B ALU inputs, and the instruction register (IR). In these cases there is no output control bit in the microinstruction. As mentioned previously, the unoptimized version of Gemini does not need the MAR and MDR. They are included for future use (hint, hint :-)Memory
The 256 byte main memory has internal address and data latches, and address decoding. Since we are not designing the memory, we will assume an interface protocol that uses the memory address register (MAR) to select the memory location, WRITE/not_read and DATA/not_address signals to activate the selected location, and the memory data register to either provide the data for a store or receive the data during a load. The operation of the memory interface is summarized in the following table:
Microprogrammed Controller
Microcode
The microcode lies at the heart of the microprogrammed controller. Control signals for each microinstruction are generated using the clock, ø, either for the level-triggered latches in the datapath, or the edge-triggered latch in the microcontroller that holds the currently executing microinstruction. The microcode (also known as the microprogram) is a 33-bit wide array of 256 microinstructions. It is not at all unusual to have a microinstruction width that is not a power of two because there is no guarantee that the number of control signals will be a power of two, and because floorspace on the CPU is so limited that it would be foolish to implement unused bits. The organization of Gemini and other typical microprograms is given in the table below. Is START the beginning or the end of a µinstruction sequence? Is the glass half empty or half full? :-)Microinstruction Format
The microinstruction is composed of four groups of control signals: - ALU group - Register group - Memory group - µController group The signals in each group cause a latch to put data onto a bus (.out), capture data from the bus (.latch), select a function or control a multiplexer, or choose the next microinstruction to execute. Most control signals are inactive during microinstruction execution, that is, most bits in the microprogram are zeros. Active signals are conditioned by the clock, ø, to produce the output and latch signals. The format of the Gemini microinstruction is called "horizontal microcode" because all control signals are represented by a bit in the microinstruction. "Vertical microcode" decodes all control signals from a microinstruction (useful if many microinstructions are repeated in machine instructions). "Diagonal microcode" encodes some fields in the microinstruction but not others, a compromise between space and time. In the Gemini microinstruction the ALU functions could be encoded as a pair of bits. A microinstruction specifies a limited form of parallelism. As many events can occur in parallel as there are functional units and buses to process and transmit data without conflict. A good microprogrammer searches for ways to reorganize the machine instruction's microcode so as to extract as much parallelism as possible from the microprogrammer's model. Experience in microprogramming can identify the bottlenecks in instruction execution that can be overcome by hardware. Pipelining is the easiest of these to see. It is possible to output data from two sources simultaneously, but because some of the same data bits will be zero and one, damaging short circuits will almost certainly occur. Sending data to two destinations at the same time is permissible if the source of the data is powerful enough (i.e., its "fan-out" is large enough).
Timing: Where Microcode Meets Hardware
The timing diagram below shows the relationship between the clock, ø, and the signals that feedback to control the microcontroller, or that are sent from the microcontroller to control the datapath.Clock Generator
The system clock provides the "heartbeat" of the microcontroller. The clock speed is determined by the time required for data to flow through the longest path in the CPU, or by the time required to execute the the most time-consuming ALU operation. This kind of design is called "synchronous" because all functions are governed by a central clock. Many instructions are slowed down because the clock cycle is the maximum necessary in the worst case. Another type of design called "asynchronous" permits the CPU to operate as fast as possible for each operation. Asynchronous operation is similar to dataflow processing, but like it, adds complexity to the CPU.Generating the Next MicroPC
First we will examine the operation of the microcontroller. The most significant function it performs is selecting the start of a sequence of microinstructions that implement the function of a machine instruction, then sequencing through the microinstructions until the machine instruction is complete.Instruction Register (IR)
The instruction register (IR) is the source of the starting address for a sequence of microinstructions. However, since there are only three bits in the machine instruction's opcode to specify the µaddress, it is necessary to translate the 1-of-8 opcodes to one of the 256 µcode addresses.MicroPC Map Table
The microPC map table translates the opcode in the IR to a µinstruction address. Note on the timing diagram that the IR, whose output is always enabled, is translated as soon as the data is followed by the latch, but is only guaranteed to be valid after the data is latched. This is still more than enough time for the next microinstruction to be enabled out of the µcode, and be present at the current µinst latch when the rising edge of the clock, ø, signals the end of the executing microinstruction by latching the next one.Action of Carry Flag
The BRC machine instruction uses the technique of "bit forcing" or "bit jamming" to select between one of two µinstruction addresses depending on the state of the carry. This is accomplished by wiring the carry into the least significant bit of the microPC map table, but ONLY for the entry that translates the BRC opcode into the BRC microinstruction address. Thus, when the carry is zero the first of two adjacent microinstructions is executed. When the carry is one, the second of the two adjacent microinstructions is executed. Since every microinstruction includes the address of the next µinstruction to be executed, it is easy for the microprogrammer to vector one of the microinstructions to the appropriate sequence to cause the machine program to branch or to fall through to the next instruction.Selecting the Next µAddress from the µInstruction
As just mentioned, each microinstruction contains the address of the next microinstruction to execute. This is equivalent to having a counter register hardwired into the microcontroller. While it is space-inefficient, this is also a very fast technique. Where the program space is small, as it is in a microprogram, and speed is necessary, this is a good design choice. Larger µcode and more complex sequencing in µcode may require a microPC implemented as a register, with its own set of µcontroller status flags.Current Microinstruction Latch
The current µinstruction is held in an edge-triggered latch so that all control bits are stable throughout the clock cycle. This is necessary so that the next microinstruction address will be valid when the clock rises at the end of the cycle, latching the next microinstruction (see the timing diagram shown earlier).Generating Control Signals
Most control signals are "and"ed with the clock, ø, to produce output/follow and latch signals (see the timing diagram). Some signals are unconditioned, such as the µPC.sel, which is thus given as much time as possible to "steer" the correct microinstruction address into the microcode ROM.ALU Timing
The ALU is one of the critical paths in Gemini, determining the maximum clock rate by the sum of the time needed to output and latch the second input (usually to the B register), compute the function of the B register and the previously latched A register, and latch the output in the accumulator. If the ALU operates too slowly, then a single operation that performs all these actions can be split into two, and the clock rate increased, perhaps even doubled. Since there are many more data transfer operations than ALU operations, the overall result is a much faster CPU.Register and Memory Timing
External memory is up to ten times slower than internal memory, largely due to the resistance and capacitance of the I/O pads on a VLSI circuit. The Gemini architecture, as used in this assignment, ignores differences in register and memory timing, assuming that both are not only connected to the same bus, but are also on the same chip.
Example Microcode
Drawing on the organization of a typical microprogram given previously, an example microprogram with RESET, part of the PROLOG, and a machine instruction µcode sequence for the "LDA m" instruction is given. The microcode is illustrated with dataflow diagrams to "animate" the operation of the system architecture. The best way to understand the microcode, and learn to write your own, is to step through the example line-by-line, comparing each line to the dataflow diagram to see how output control signals and latch control signals are used to route data from source to destination. For dataflow diagrams where the timing is critical, such as decoding the opcode in the IR into a microinstruction, compare the dataflow diagram to the timing diagram. You could, if you want to calculate the maximum clock speed for the Gemini, look up the timing specification of 74LS373 and 74LS374 latches, and the AM2901 Register and ALU slice to get timing data.Dataflow Diagram Animations of "LDA m" Execution
The processor is reset (exactly how we will not consider), and executes the machine instruction at location 0x00 in the Gemini's main memory. THE FOLLOWING ANIMATION ASSUMES THAT THE INSTRUCTION IS "LDA m", BUT IT COULD BE ANY INSTRUCTION!The critical datapath occurs here, when the ONE constant is passed through the B latch to be added with carry into the accumulator.The key step in decoding occurs here, when the IR selects an entry in the map table, which in turn selects the next microinstruction.The "LDA m" machine-instruction specific µcode starts here.The "LDA m" machine instruction is finished, using the point of view that the PROLOG must clean up after each instruction by incrementing the PC if necessary. This explains the branch to START.The PROLOG is now finished, and the next Gemini machine instruction's address is in the MAR, ready to be fetched to start the next sequence of µinstructions.