CPU Instructions

A computer chip can do simple arithmetic, compare numbers, and move numbers around in memory. Everything else, from word processing to browsing the Web, is done by programs that use those basic instructions. CPUs get faster in three ways. First, better designs can do the simple operations faster. Second, better design can do as many as six simple operations at the same time in different areas of the CPU. Thirdly, since a lot of time is lost if the CPU has to wait for data from slower memory, techniques that reduce the memory wait time appear to speed up the CPU.

Its all Numbers [but not Math]

At the hardware level, a computer executes sequences of individual instructions. Each instruction tells the computer to add, subtract, multiply, or divide two numbers, compare numbers to see if they are equal or which is larger, and move numbers between the CPU and a location in memory. The rest of the instructions are mostly housekeeping.

Everything in a computer is represented as numbers. Each memory location has a numeric "address" that identifies it. Each I/O device (disk, CD, keyboard, printer) has a range of assigned address numbers. Every key pressed on the keyboard generates numbers. Every dot on the computer monitor has an address, and the color of the point is represented by three numbers that mix the primitive colors of red, green, and blue. Sound is represented as a stream of numbers.

Consider the automatic error correction in a word processor. If you type in " teh " the computer appears to recognize the common misspelling and changes it to " the ". What does this have to do with numbers? Well, every character that you type, including the space bar, transmits a code to the computer. The code is ASCII, and in that code a blank is 32, "a" is 97 and "z" is 122. So the computer sees " teh " as the sequence 32 116 101 104 32. The word processor has been programmed to check for this sequence, and when it sees it it exchanges the 101 and 104. The CPU chip doesn't know about spelling, but it is very fast and accurate handing numbers.

You might then think that the speed of the computer is determined by how fast it can add.  People expect this because adding large numbers takes us a long time. Ask someone how much is 2+2 and they will respond immediately 4. Ask how much is 154373 + 382549 and they will stop for a minute and take out a pencil. A computer adds numbers with electronic circuits that work just as fast for large or small numbers. Arithmetic is what computers do best, and they do it almost instantly.

If I ask you to add 2+2, you can do it immediately. Now suppose I put two numbers in different rooms of your house, write the name of the room on a sheet of paper, put the paper in an envelope, and ask you how long it will take to find the numbers and add them. You won't know until you open the paper and find out where the numbers are. It gets worse if some of the numbers are in other houses in the neighborhood and I put the address of the house on the paper instead of just the room name.

The CPU has different places to store numbers. It has 8 or 16 "registers" which require no delay at all. It has "L1 cache" which is almost instantaneous, and L2 Cache which is just a little slower. Then it has main memory. Memory is very fast these days, but it is so slow compared to the speed of the CPU that you waste hundreds of instructions waiting for a response.

The computer processes instruction through a sequence of steps. First you have to read the next instruction itself, which may be in cache or it may be in memory and have to be fetched. Then the computer decodes the instruction to determine what is to be done and more importantly, where is the data that the instruction needs. It may be in registers, L1 cache, L2 cache, or memory. The CPU has to fetch the data, then turn the instruction over to one of the processing units. There are many preliminary steps, and then several processing steps. So the CPU processes instructions through a "pipeline" that behaves like an assembly line (where the work comes to the workers) or like a cafeteria line (where the users come to the food).

A CPU is measured by how many instructions it can process in a second, not by how long it takes to process any single instruction. Consider a fast food counter. They have a bunch of lines, several people working the counter, and lots of people in back cooking the food. They measure themselves by how many customers they serve in any period of time. When you come to the the front of the line, the item you want may be temporarily unavailable and you have to step aside. It might take you unusually long to get your burger, but lots of other people are being served during the period. To you, the service is slow. To the business, they are moving lots of people through.

In the same way, a CPU is designed to fetch programming, fetch data, and execute instructions. Sometimes a particular instruction needs data that is not immediately available. All modern processors can push the instruction aside and have it wait while subsequent instructions are serviced. Speed is measured by the overall throughput of the chip.

The High School Analogy

The first generation of PC CPU chips was like a one room schoolhouse. A class of students could enter and be seated. The first period would be English. When the bell rings, they switch books and take a period of Math. Then History, a Language, and finally Science. After the last subject, the school day is done. However, in the computer version of the "school" another class of students immediately enter the building and begin their subjects.

If you want the school to educate students more efficiently, you could try to shorten the periods (speed up the clock). However, you can also speed up things by building more classrooms. That is what happened with the 286, 386, and 486 generations of chips. In a school designed like a 486, there is one classroom for each subject. When the bell rings, the students in the English room move to Math, the Math students move to History, and so on. The students in the last class, Science, leave the school. A new class enters and sits down in the English classroom to begin their sequence of subjects.

Each new generation of chips typically triples the number of circuits of the previous generation. So the fifth generation chip, the Pentium, added a complete second set of classrooms. Now two groups of students would take each subject at the same time.

If the first five generations of CPU acted like a grade school and then a high school, processors after the Pentium II act a bit like college. The chip has some larger number of internal instruction processing stations. Some handle integers, and some handle floating point numbers. Instructions enter execution and are given a sequence of operations they need to perform. In a sense, the instructions wander around from station to station with some level of independence. Some instructions get done quickly, some take longer. However, there is a rule that the instruction must end in the order in which they began. So the instructions that get done quickly have to wait at the exit for the slower instructions that entered before then to finish up.

This analogy also explains an important detail about the clock rate. Speeding up the clock doesn't tell the computer to do anything faster. Each circuit performs its operation as fast as it can. The clock tells the circuits when to begin the next set of operations. If the clock is too fast, the next operation begins before the previous operation is complete, the data is corrupted, and the system crashes.

Dependent Instructions

Suppose you want to add three numbers together:

    5 + 22 + 7

A person and a computer program will first add 5 to 22 getting 27. Then adding 27 to 7 gets 34. Two operations are performed. Since the second operation uses the result (27) from the first operation, they have to be done in order.

Now consider adding four numbers together:

    5 + 22 + 7 + 18

A person will accomplish this by appending a third operation that adds the 34 calculated by the first two operations to 18 to get 52. However, a computer can perform more than one numerical operation at the same time, provided that the two operations are independent of each other. So if you want to optimize this for a modern PC, you would arrange the instructions as follows

  1. Add 5 and 22  (27)
  2. Add 7 and 18 (25)
  3. Add the results of the previous two steps, 27 and 25, together (52).

Since steps 1 and 2 don't depend on each other's results, they can both be run at the same time. Step 3 requires the results of both previous steps, so it runs in the next cycle. As a result, the computer can add four numbers together in the same two cycles it took to add just three numbers together, because the first two operations can both run in the first cycle at the same time.

The original Pentium chip could execute two instructions at the same time, provided that they were not dependent on each other. It required the programmer or compiler arranging the instructions in an optimal order. The Pentium II, III, and Pentium 4 CPU chips internally rearrange instructions when they are not dependent on prior results, so optimization doesn't depend as much on how the program is coded.

Registers

All computers designed in the last forty years hold data in "registers". If you are adding up a column of numbers, the register holds the running total. If you are scanning a document for spelling errors, a register keeps track of your location in the document.

The original 16-bit Intel CPU design had a very small number of highly specialized registers known by letters. As it happened, the letters were associated with words that described their use. If you were adding up numbers in the column of a spreadsheet, the A register "accumulated" the total, the B register was the "base" and pointed to the column or cell, and the C register held the "count" of the number of cells remaining to be added.

In 1986 Intel introduced the 386 CPU with a new set of 32-bit instructions. The original seven highly specialized 16-bit registers became seven largely interchangeable general purpose registers. However, it was not until nine years later that Microsoft released a generally available operating system (Windows 95) that made use of the 386 instructions and registers.

The 32-bit instruction set of the 386 chip has survived for almost 20 years. Meanwhile, Moore's Law tells us that the number of circuits on a chip doubles about every 18 months. Hardware is much easier to change than all the software. A modern CPU chip has a lot more than 7 registers, but they are invisible to the user and even to the operating system.

A program may have a sequence of operations that one after another "accumulate" different totals into the A register. In each step, a different "count" may be loaded into the "C" register. However, each of these operations may be independent of the other. Under the covers, the CPU may recognize this and speed up processing by allowing operations to run in parallel. I doing so, the CPU will assign two real registers to pretend to be the A and C registers for one group of operations, while a different pair of real registers will pretend to be A and C for different operations. Of course this pretending is complicated and only goes so far.

In 2004 AMD introduced its Athlon 64 family of processors with a 64-bit instruction set. Initially Intel resisted, but it has finally caved in and cloned the AMD operating design. Server programs benefit from the ability to use more than 4 Gigabytes of memory, which after all is only about $400 worth of memory. However, for every type of program the more important feature of the new 64-bit instructions may be a new set of 8 registers that compilers can now use to optimize program execution. A few programs run 20 to 30% faster thanks to the extra registers.

Memory Access Delay

Memory is a lot slower than the CPU. If an instruction requires data that is out in the main memory of the computer, it may have to wait for a period of time equal to the processing of hundreds of instructions. Since some of the subsequent instructions will depend on the results of this previous operation, the CPU will halt waiting for memory.

To get around this problem, a CPU has two types of internal high speed memory to hold recently used instructions and data. This high speed memory is called "cache".

The best type of internal memory is the Level 1 (L1) cache. This memory is part of the CPU core along with the units that decode instructions and perform arithmetic. If the instruction and data are in L1 cache then the CPU can execute at full speed. The modern Intel processors have 32K of L1 internal cache. Competing processors from AMD have even more.

When the instruction or data is not found in the L1 cache, modern processors have a larger amount of Level 2 cache associated with each CPU core. The 65nm generation of Intel processors commonly available during 2007 had 2 or 4 megabytes of L2 cache. The next generation of 45 nm processors available during 2008 will have larger L2 caches.

Each L2 cache is tied to a specific processor core. Some CPU chips have an additional Level 3 cache that is shared by all the CPU cores. Eventually an instruction requires data that is not in any of the cache levels, so it has to get the data from memory.

The main memory of the computer is Synchronous Dynamic Random Access Memory (SDRAM). Random Access means that any location in memory can be used after any other location. Dynamic Random Access refers to an architecture that provides lots of memory but at a slower speed than the "Static" memory architecture used in cache. Synchronous means that the memory transfers data at a fixed speed determined by an external clock, like music students in class keeping time to a ticking metronome.

AMD connects the memory directly to the CPU chip, and Intel is expected to do the same thing in the generation of CPU chips that will become available during 2009. For now, an Intel CPU chip connects to a Northbridge chip on the mainboard, and the Northbridge connects to the memory and provides the memory clock. When a Northbridge is use, there is no requirement that the data transfer speed of the CPU exactly match the data transfer speed of the memory. The Northbridge can buffer data and slow down whichever device is faster. However, the bus speed of memory determines the fastest speed at which data can be transferred. There is another number that determines how fast the memory really is.

After the computer generates the address of the desired memory location, there is a delay called the "latency" before the memory begins to respond. Then it transfers data at the rated clock speed. The problem is that the latency is measured in tens of nanoseconds, and when a modern CPU can execute 12 to 24 instructions per nanosecond.

Latency is the performance killer. In the time it takes to fetch a new byte of data from a new address, the CPU could have executed hundreds of instructions. By reordering subsequent instructions that do not depend on the results of this memory fetch, a CPU might continue to run for a few dozen instructions, but then it will stop. Even if the L1 and L2 cache handle more than 99.5% of all data requirements inside the CPU itself, the latency delay may mean that a typical CPU with a typical workload spends half its time waiting for the memory to respond while executing no instructions.

The most visible number on a memory stick is its clock speed. DDR 2 memory may be rated at 533, 667, 800, or 1066 MHz. Latency is then expressed in terms of these clock ticks. There are several latency numbers, but the most important is CAS latency. If you double the speed of the clock (from 533 to 1066 MHz), but then also double the CAS latency from 4 to 8, then the higher clock speed hasn't really done much.

RISC Architecture

The first Intel "CPU on a chip" was the 4004 processor. It was more like a pocket calculator than a real computer. It handled ordinary base 10 digits encoded as four bits. Later chips added the ability to handle 8 bit, 16 bit, and 32 bit numbers. So on a modern Intel CPU chip there is no single Add instruction. Instead, there are separate Add operations for digits, bytes, and every other size of number. The resulting set of possible instructions is a mess. This is typical of a "Complex Instruction Set" computer chip.

In your Sunday paper, right next to the CompUSA insert there is probably something from Sears. Look at the last few pages of the ad, where they show the tools. There will almost certainly be a picture of the traditional "190 Piece Socket Wrench Set." If you purchased this item, you would always have the right tool for any job. In reality, it is almost impossible to keep all the pieces organized, and you will spends minutes searching through all the attachments to find one of the right size.

Go to a tire store. They lift your car off the floor, remove the hubcaps, and then pick up a gun shaped device connected to a hose. "Zuuurp" and each bolt comes off the wheel. You could do the same thing with the 190 Piece Socket Wrench Set, but every garage knows that automotive wheel bolts come in only one size. So they don't have to spend time searching for the right size tool, and they can optimize the one size that they really need.

When computer designers realized the same thing, it was called Reduced Instruction Set Computers or RISC. Make all the instructions the same size. Use only one size of data. Simplify the instructions and therefore the operation decode. Then use all the room on the chip to optimize what is left, rather than filling the chip with support for instructions that are seldom executed.

Today the RISC philosophy of CPU design is represented by the processor in most cell phones, the IBM Power line of processors used in the XBox 360, PS/3, and Wii, and big Unix servers. However, the advantage of a Reduced Instruction Set turned out to be most important in the period when chips have 2-3 million transistors (during the period of the late 486 chips and the early Pentium chips). When the PowerPC was first announced, it was billed as having "the power of a Pentium at the price of a 486."

Every 18 months the CPU chip doubles the number of transistors it can hold. Today's CPU has hundreds of millions of transistors. It quickly became unimportant to alter the work to simplify the design of the computer. RISC today has its greatest effect in video game consoles, where the computer program is specifically designed for the hardware and maximum performance is worth the extra investment in design.

Pipeline, Superscalar

Although a tire store may be fast at changing tires, when you really need speed look at how they do things in Indianapolis. A race car pulls into the pit for service. They jack it off the ground, and then four teams of mechanics go to work on all four wheels simultaneously. The car is back in the race in a matter of seconds. In ordinary life, such service would be prohibitively expensive. But in the world of microelectronics, transistors are cheap.

A pipeline is the sequence of processing stations that decode instructions, fetch data, perform the operation, and save the results. Inside the CPU, instructions are processed at a sequence of stations that resemble an assembly line. Fifteen years ago, a CPU would process instructions in five or six steps. Each step is completed in one clock cycle.

In order to speed up the clock, it is necessary to break the processing down into smaller steps that can be accomplished in the shorter clock cycle. A modern Intel CPU may have a pipeline with 40 steps in it. One instruction in the program occupies each step. At each tick of the clock, all of the instructions advance one step forward in the pipeline. An instruction may finish at the end of the line, and a new instruction may enter at the beginning.

Pipelines have a potential problem whenever the program encounters a branch instruction. This is a decision point where the program will continue by executing one of two alternate paths of new instructions. The problem is that the CPU will not really know which of the two paths will be taken until the branch instruction is at or near the end of the pipeline. To keep the pipeline full, the CPU has to guess which of the two alternate instruction paths will be executed and begin processing it through the pipeline. If this "branch prediction" is wrong, then the partially executed path has to be abandoned, and the correct path has to enter the pipeline at the beginning. A mistake in branch prediction can cause the CPU to miss around 30 clock cycles of execution.

A computer is superscalar when it can execute more than one instruction per clock cycle. The pipeline discussion talked about one instruction ending and one beginning at every clock tick. A Pentium 4 CPU can actually start or terminate up to three instructions in a clock cycle. Along the pipeline, most of the processing steps are duplicated. The CPU can be adding two or more pairs of numbers at the same time.

However, one of the things that makes a Pentium 4 or AMD CPU so complicated is that this ability to execute more than one instruction at a time has to be completely hidden from the program. The program is written to execute one instruction after the other, and the CPU produces results that exactly duplicate this behavior. So to use the extra processing power, the CPU chip must have a large amount of complex control logic to detect when two instructions that the program is written to execute one after the other are actually independent and can really be executed at the same time.

SIMD

There are two processing units in a typical home computer. The CPU is made by Intel or AMD, and it is the chip you normally hear about. However, in most systems there is actually a second chip that, in raw computational ability, is a much more powerful computer. It is the main chip on the video card, the Graphics Processing Unit or GPU.

The GPU is not the kind of general purpose computer for which you could write an operating system or applications. It does a small number of things over and over, but it is very fast when doing them. It also has some local memory that may be faster than the main memory of your computer.

What makes the GPU so powerful? Data is displayed on the screen as a set of dots. Each dot is represented by three numbers for the three colors. Three dimensional applications (mostly video games) execute mathematical operations to calculate the correct values for each color of each dot in some area of the screen. Video images are compressed into the MPEG 2 streams of a DVD or HDTV by comparing the colors of adjacent dots with trial and error to find a mathematical sequence that can generate the same image pattern while occupying considerably less memory.

This can always be done one instruction at a time, but it is repetitive. More importantly, whatever you do to one dot you also have to do to the next dot and the one after it.

In a square dance, someone stands at the microphone calling out the next step. In unison, all the dancers on the floor do the same thing, then the caller announces another step.

You can design a processing unit the same way. One part of the processor reads the program and determines what the next operation should be. However, unlike a PC CPU, the instruction doesn't apply to one number or a pair of numbers. Instead, a whole line of numbers has been loaded into the unit, and the one instruction applies to all of them at the same time. This is called SIMD, for "Single Instruction, Multiple Data". Thirty years ago on big room sized mainframe computers, it was called "vector processing."

Fifteen years ago, the first SIMD chips began to be used in Personal Computers. They weren't powerful enough to be used for video applications, but they could provide support for the much less complicated processing of audio data. Such chips are called DSPs for "Digital Signal Processor". They could be used for everything from computer modems to removing the sound of scratches from old phonograph records. Today, CPUs and SIMD are much faster and more powerful.

There is a small amount of SIMD capability built into the Intel and AMD CPU chip. It is used to support multimedia and games. In an Intel chip, it is called MMX, SSE, SSE2, and SSE3. AMD SIMD is called "3DNow!", and like Intel it has gone through several generations.

A number of vendors are building specialized SIMD CPU chips that are somewhere between the highly specialized design of the GPU and the general design of the CPU. Sony and IBM are collaborating on the "Cell" processor for the Playstation 3. Smaller vendors offer boards that can plug into a conventional PC to speed up the processing of games or scientific computation.

More Core or Fusion?

Intel and AMD now have mainstream products with two CPU cores and are beginning to introduce chips with four CPU cores. Extra general purpose cores are useful in corporate servers where dozens of different requests can come in each second from hundreds of remote users. However, the typical desktop user only does a few things at once.

Every 18 months the number of transistors on a chip is expected to double. This could be used to create CPU chips with 4, 8, or 16 general purpose cores, but there is another model.

The "Cell" CPU chip in the Playstation 3 is based on a radically different design where there is only one general purpose CPU core, but then there are six specialized processors each with their own internal memory. These specialized processors perform bulk operations on blocks of data under the control of the one general purpose core.

In a desktop computer, the Graphics Processing Unit (GPU) in the video card  also has its own memory and the ability to do bulk processing on blocks of data sent to it by the CPU. The GPU handles 3D games and High Definition movies by design. There are a handful of programs that can transfer processing from the CPU to the GPU, and when this is possible the GPU tends to process the work 7 times faster than the general purpose CPU.

AMD proposes a long term strategy called "Fusion". Instead of 8 or 16 general purpose CPU cores, they propose that future CPU chips may contain a maximum of 4 general purpose cores and then some sort of specialized processors similar to those in the GPU or Cell. Intel seems to be working on the same design, but without a codename. Right now the problem is that any of these designs requires specialized programming and, other than compressing a TV program so it displays on your iPhone, there is no "killer app" driving demand.

The CPU Market

At CPU can execute two or three instructions per cycle. Memory continues to have a delay (the "latency") of around 50 nanoseconds between the time the CPU makes a request for data and the time that the memory can respond with the data. If the CPU has to wait for memory, a 3 GHz processor could have executed conservatively 2 (instructions per clock) times 3 (cycles per nanosecond) times 50 nanoseconds or 300 instructions in the time it takes the memory to respond. Making the CPU run faster doesn't help.

Fortunately, programs use the CPU in only a few common patterns:

-        An interactive user running Office or a Web browser uses the CPU only in short bursts. The computer sits around waiting for the next keystroke or for some data to arrive over the network. If the CPU were able to respond in .01 seconds instead of .02 seconds, the human being would not notice the difference.

-        A computer processing a stream of video data or running a video game has a lot of processing to do. The CPU is 100% busy. Most of the time, the program fetches the next byte of data from the stream. Memory access can be predicted, so the CPU seldom has to wait for an unexpected reference to a random memory location.

-        A computer acting as a Web or application server, however, runs hundreds of small programs on behalf of thousands of remote user requests. No single request uses a lot of CPU. There is no way to anticipate the next request or the data that it will need from memory. This is the kind of usage pattern where the CPU is most likely to have to wait for memory to respond with data, and because the server handles so many remote users, it is also the case where performance is most important. If the CPU supports HyperThreading, then when the instructions for one thread block waiting for memory, there is an entirely different thread sitting in the CPU with an independent set of instructions able to execute until memory responds or, by bad luck, until the second thread also blocks for memory.

Currently Intel and AMD have four families of CPU chip.

  1. Core 2 Duo ("Mainstream"). Plugs into a socket with 775 pins and transfers data four times per tick of a 200 to 333 MHz clock (800 to 1333MHz FSB). Internally it runs at a speed from 2 to 3 GHz. Prices run from $125 to $(silly) per chip. AMD calls this the Athlon.
  2. Celeron ("Value"). An inexpensive version of the mainstream that plugs into all the same boards. It has a single core, slower internal clock speeds, slower FSB speed, and smaller cache. It also uses a lot less power. Celeron prices, however, are $40 to $65. AMD calls this the Semperon.
  3. "Mobile". Various versions of the single core mainstream chip that have been optimized to run at very or ultra low power. Some versions of this chip drop the battery use from the conventional desktop 60-90 watts down to as low as 10 or even 5 watts. AMD calls this the Turion.
  4. Xeon ("Server"). This is a version of the mainstream chip that has been modified so a mainboard can have two or more CPU chips. Xeon was also the first family to roll out a Quadro (4 core chip). Intel has demonstrated servers with four Quad-core Xeon chips, providing a total of 16 CPUs. AMD calls this the Opteron.

These marketing labels have survived many generations of hardware change so they reflect a price and target rather than any particular technology.

Intel has certain technological advantages on the margins, but it is not clear that they translate into an advantage to any of us. The Nehalem generation of processors is faster, but the mainboards are so much more expensive that the total AMD package is often more attractive. Intel and AMD both have Quad core models, but AMD has the only CPU chips with three working cores. A recent test of many applications by Tom's Hardware shows that most applications only benefit from the first three processors and that the fourth core sometimes slows things down. (Computer programs using more than one core have too periodically check in, coordinate action, and communicate. This adds overhead. As the number of cores increases, the overhead goes up but the benefit of the next core goes down. At some point the overhead becomes greater than the benefit, and this article suggests that the crossing point for many applications is around the point where you add the third core.)