Microarchitecture

In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch) is a description of the electrical circuitry of a computer, central processing unit, or digital signal processor that is sufficient for completely describing the operation of the hardware.

Scholars use the term "computer organization". People in the computer industry more often say, "microarchitecture". Microarchitecture and instruction set architecture, together, are the field of computer architecture.

Origin of the term

Computers have been using microprogramming of control logic since the 1950s. The CPU decodes the instructions and sends signals down appropriate paths by means of transistor switches. The bits inside the microprogram words controlled the processor at the level of electrical signals.

The term: microarchitecture was used to describe the units that were controlled by the microprogram words. Architecture usually had to be compatible between hardware generations. The underlying microarchitecture could be easily changed.

Relation to instruction set architecture

The microarchitecture is related to, but not the same as, the instruction set architecture. The instruction set architecture is near to the programming model of a processor as seen by an assembly language programmer or compiler writer. It includes the execution model, processor registers, memory address modes, address and data formats, etc. The microarchitecture (or computer organization) is mainly a lower level structure. It manages a large number of details that are hidden in the programming model. It describes the inside parts of the processor and how they work together to implement the architectural specification. ^[1]^[2]^[3]

Microarchitectual elements may be everything from single logic gates, to registers, lookup tables, multiplexers, counters, etc., to complete ALUs, FPUs and even larger elements. The electronic circuitry level can be subdivided into transistor-level details, such as which basic gate-building structures are used and what logic implementation types (static/dynamic, number of phases, etc.) are chosen, in addition to the actual logic design used built them.

A few important points:

A single microarchitecture, especially if it includes microcode, can be used to implement many different instruction sets, by means of changing the control store. This can be quite complicated even when simplified by microcode and/or table structures in ROMs or PLAs.
Two machines may have the same microarchitecture, and so the same block diagram, but completely different hardware implementations.^[4] This manages both the level of the electronic circuitry and even more the physical level of manufacturing (of both ICs and/or discrete components).
Machines with different microarchitectures may have the same instruction set architecture, and so both are capable of executing the same programs. New microarchitectures and/or circuitry solutions, along with advances in semiconductor manufacturing, are what allows newer generations of processors to achieve higher performance.

Simplified descriptions

A very simplified high level description — common in marketing — may show only fairly basic characteristics, such as bus-widths, along with various types of execution units and other large systems, such as branch prediction and cache memories, pictured as simple blocks — perhaps with some important attributes or characteristics noted. Some details regarding pipeline structure (like fetch, decode, assign, execute, write-back) may also be included.

Aspects of microarchitecture

The pipelined datapath is the most commonly used datapath design in microarchitecture today. This technique is used in most modern microprocessors, microcontrollers, and DSPs. The pipelined architecture allows multiple instructions to overlap in execution, like an assembly line. The pipeline includes several different stages which are fundamental in microarchitecture designs.^[4] Some of these stages include instruction fetch, instruction decode, execute, and write back. Some architectures include other stages such as memory access. The design of pipelines is one of the central microarchitectural tasks.

Execution units are also essential to microarchitecture. Execution units include arithmetic logic units (ALU), floating point units (FPU), load/store units, and branch prediction. These units perform the operations or calculations of the processor. The choice of the number of execution units, their latency and throughput are important microarchitectural design tasks. The size, latency, throughput and connectivity of memories within the system are also microarchitectural decisions.

System-level design decisions such as whether or not to include peripherals, such as memory controllers, can be considered part of the microarchitectural design process. This includes decisions on the performance-level and connectivity of these peripherals.

Unlike architectural design, where a specific performance level is the main goal, microarchitectural design pays closer attention to other constraints. Attention must be paid to such issues as:

Chip area/cost.
Power consumption.
Logic complexity.
Ease of connectivity.
Manufacturability.
Ease of debugging.
Testability.

Micro-Architectural Concepts

In general, all CPUs, single-chip microprocessors or multi-chip implementations run programs by performing the following steps:

Read an instruction and decode it.
Find any associated data that is needed to process the instruction.
Process the instruction.
Write the results out.

Complicating this simple-looking series of steps is the fact that the memory hierarchy, which includes caching, main memory and non-volatile storage like hard disks, (where the program instructions and data are) has always been slower than the processor itself. Step (2) often introduces a delay (in CPU terms often called a "stall") while the data arrives over the computer bus. A big amount of research has been put into designs that avoid these delays. A central design goal was to execute more instructions in parallel, thus increasing the effective execution speed of a program. These efforts introduced complicated logic and circuit structures. In the past such techniques could only be implemented on mainframes or supercomputers due to the amount of circuitry needed for these techniques. As semiconductor manufacturing progressed, more and more of these techniques could be implemented on a single semiconductor chip.

What follows is a survey of micro-architectural techniques that are common in modern CPUs.

Instruction Set choice

The choice of which Instruction Set Architecture to use affects the complexity of implementing high performance devices. Computer designers did their best to simplify instruction sets, to enable higher performance implementations by saving designers effort and time for features which improve performance rather than wasting them on the complexity of instruction set.

Instruction set design has progressed from CISC, RISC, VLIW, EPIC types. Architectures that are dealing with data parallelism include SIMD and Vectors.

Instruction pipelining

Main page: Instruction pipelining

One of the first, and most powerful, techniques to improve performance is the use of the instruction pipeline. Early processor designs performed all of the steps above on one instruction before moving onto the next. Large portions of the processor circuitry were left idle at any one step; for instance, the instruction decoding circuitry would be idle during execution.

Pipelines improve performance by allowing a number of instructions to work their way through the processor at the same time. In the same basic example, the processor would start to decode (step 1) a new instruction while the last one was waiting for results. This would allow up to four instructions to be "in flight" at one time, making the processor look four times as fast. Although any one instruction takes just as long to complete (there are still four steps) the CPU as a whole "retires" instructions much faster and can be run at a much higher clock speed.

Cache

Main page: CPU cache

Improvements in chip manufacturing allowed more circuitry to be placed on the same chip. Designers started looking for ways to use it. One of the most common ways was to add an ever-increasing amount of cache memory on-chip. Cache is a very fast memory that can be accessed in a few cycles as compared to what is needed to talk to main memory. The CPU includes a cache controller which automates reading and writing from the cache, if the data is already in the cache it simply "appears," whereas if it is not the processor is "stalled" while the cache controller reads it in.

RISC designs started adding cache in the mid-to-late 1980s, often only 4KB in total. This number grew over time, and typical CPUs now have about 512KB, while more powerful CPUs come with 1 or 2 or even 4, 6, 8 or 12MB, organized in multiple levels of a memory hierarchy. Generally speaking, more cache means more speed.

Caches and pipelines were a perfect match for each other. It didn't make much sense to build a pipeline that could run faster than the access latency of off-chip cash memory. Using on-chip cache memory instead meant that a pipeline could run at the speed of the cache access latency, a much smaller length of time. This allowed the operating frequencies of processors to increase at a much faster rate than that of off-chip memory.

Branch Prediction and Speculative execution

Main page: Branch prediction

Main page: Speculative execution

Pipeline stalls and flushes due to branches are the two main things preventing achieving higher performance through instruction level parallelism. From the time that the processor's instruction decoder has found that it has encountered a conditional branch instruction to the time that the deciding jumping register value can be read out, the pipeline might be stalled for several cycles. On average, every fifth instruction executed is a branch, so that's a high amount of stalling. If the branch is taken, its even worse, as then all of the subsequent instructions which were in the pipeline needs to be flushed.

Techniques such as branch prediction and speculative execution are used to reduce these branch penalties. Branch prediction is where the hardware makes educated guesses on whether a particular branch will be taken. The guess allows the hardware to prefetch instructions without waiting for the register read. Speculative execution is a further enhancement in which the code along the predicted path is executed before it is known whether the branch should be taken or not.

Out-of-order execution

Main page: Out-of-order execution

The addition of caches reduces the frequency or duration of stalls due to waiting for data to be fetched from the main memory hierarchy. It does not get rid of these stalls entirely. In early designs a cache miss would force the cache controller to stall the processor and wait. Of course there may be some other instruction in the program whose data is available in the cache at that point. Out-of-order execution allows that ready instruction to be processed while an older instruction waits on the cache, then re-orders the results to make it appear that everything happened in the programmed order.

Superscalar

Main page: Superscalar

Even with all of the added complexity and gates needed to support the concepts outlined above, improvements in semiconductor manufacturing soon allowed even more logic gates to be used.

In the outline above the processor processes parts of a single instruction at a time. Computer programs could be executed faster if multiple instructions were processed simultaneously. This is what superscalar processors achieve, by replicating functional units such as ALUs. The replication of functional units was only made possible when the integrated circuit (some times called "die") area of a single-issue processor no longer stretched the limits of what could be reliably manufactured. By the late 1980s, superscalar designs started to enter the market.

In modern designs it is common to find two load units, one store (many instructions have no results to store), two or more integer math units, two or more floating point units, and often a SIMD unit of some sort. The instruction issue logic grows in complexity by reading in a huge list of instructions from memory and handing them off to the different execution units that are idle at that point. The results are then collected and re-ordered at the end.

Register renaming

Main page: Register renaming

Register renaming refers to a technique used to avoid unnecessary serialized execution of program instructions because of the reuse of the same registers by those instructions. Suppose we have two groups of instruction that will use the same register. One set of instruction is executed first to leave the register to the other set, but if the other set is assigned to a different similar register both sets of instructions can be executed in parallel.

Multiprocessing and multithreading

Due to the growing gap between CPU operating frequencies and DRAM access times, none of the techniques that enhance instruction-level parallelism (ILP) within one program could overcome the long delays when data had to be fetched from main memory. The large transistor counts and high operating frequencies needed for the more advanced ILP techniques required power dissipation levels that could no longer be cheaply cooled. For these reasons, newer generations of computers have started to utilize higher levels of parallelism that exist outside of a single program or program thread.

This trend is sometimes known as "throughput computing". This idea originated in the mainframe market where online transaction processing emphasized not just the execution speed of one transaction, but the capacity to deal with big numbers of transactions at the same time. With transaction-based applications such as network routing and web-site serving greatly increasing in the last decade, the computer industry has re-emphasized capacity and throughput issues.

One technique of how this parallelism is achieved is through multiprocessing systems, computer systems with multiple CPUs. In the past this was reserved for high-end mainframes but now small scale (2-8) multiprocessors servers have become common for the small business market. For large corporations, large scale (16-256) multiprocessors are common. Even personal computers with multiple CPUs have appeared since the 1990s.

Advances in semiconductor technology reduced transistor size; multicore CPUs have appeared where multiple CPUs are implemented on the same silicon chip. Initially used in-chips targeting embedded markets, where simpler and smaller CPUs would allow multiple instantiations to fit on one piece of silicon. By 2005, semiconductor technology allowed dual high-end desktop CPUs CMP chips to be manufactured in volume. Some designs, such as UltraSPARC T1 used simpler (scalar, in-order) designs to fit more processors on one piece of silicon.

Another technique that has become more popular is multithreading. In multithreading, when the processor has to fetch data from slow system memory, instead of stalling for the data to arrive, the processor switches to another program or program thread which is ready to execute. Though this does not speed up a particular program/thread, it increases the overall system throughput by reducing the time the CPU is idle.

Multithreading is equivalent to a context switch at the operating system level. The difference is that a multithreaded CPU can do a thread switch in one CPU cycle instead of the hundreds or thousands of CPU cycles a context switch normally requires. This is achieved by replicating the state hardware (such as the register file and program counter) for each active thread.

A further enhancement is simultaneous multithreading. This technique allows superscalar CPUs to execute instructions from different programs/threads simultaneously in the same cycle.