A new direction in CPU design

An Intel research paper suggests a new approach to the circuit design of a CPU chip that might show up in future Intel products. Current systems have no internal error detection circuits, so they depend on wide margins of tolerence so that errors are very, very unlikely. Suppose, however, a future chip design adds internal error detection and correction to every CPU unit, and then runs the chip at higher speed or lower power that makes the now correctable errors much more likely now that they are safe.

The basic ideas are easy to understand if you will allow some simplified analogies. A computer chip consists of circuits that are either 1 or 0. Actually they are transistors and capacitors that can be regarded as full or empty of an electric charge. You can think of this as a billion shot glasses that are either full or empty of liquid.

A bartender may pour one glass at a time, but even then he quickly learns about how long it takes to fill a glass so he doesn't have to watch it carefully. If you have to pour a billion glasses at once, you are definately going to estimate the time. The computer does this with its internal "clock".

If you pour too long, the glass spills over and makes a mess. Inside the computer this is not a problem with the electric charges. However, different circuits fill at different speeds because of slight differences in the manufacturing process and materials. If you stop pouring too soon, a circuit that was supposed to be full is not and then the computer makes a mistake.

Voltage operates like water pressure. Increase the voltage and you can fill the glass more quickly. The problem is that higher voltage also generates heat and uses power.

Modern mass market CPU chips have no ability to detect internal errors. They are designed and tested to work flawlessly at a particular speed. If they make a mistake, it goes undetected and maybe the system crashes.

Because of small possible flaws, they test each individual chip to make sure that all the circuits are working correctly. The chip may be designed to make an error only once a century, but you cannot test a chip for 100 years before selling it. So the manufacturer runs the chip at a lower than standard voltage and a higher than standard clock rate. These changes would cause any marginal circuit to fail. If the chip tests OK under these conditions, then it can be sold to run at the standard voltage and clock rate with confidence there will be no problems.

The margin to avoid error appears to be 33%. That is, a typical CPU chip can run one third faster or using one third less power and it will work mostly right only occasionally generating an error.

Every two years the size of computer circuits is cut in half. It is then possible to double the number of circuits in a CPU chip. Up to this point, computers have concentrated on extra processing units, more cores, more cache memory. The Intel research paper suggests that a future generation change might benefit by using some of the extra circuits to check results and detect processing errors. When an error occurs, the extra circuits detect it and the operation is reexecuted in another part of the chip or at a slower speed to get the correct result.

A CPU chip executes billions of operations a second. Today it just assumes they all execute correctly. If a few operations fail each minute, but the failure is detected and the instructions are rexecuted automatically, these retries would not be noticable. However, such a design would mean that a desktop machine could run at a clock speed that was one third faster or a server could run at a power draw that was one third less than a conventional CPU without error detection.

Today, the only commercially available CPU system with error recovery is the IBM Z-Series mainframe computer, where the hardware is hundreds of times more expensive than typical desktop systems. The Intel research paper certainly presents an idea that is "outsize the box" compared with previous CPU chip design. Instead of the simple bruit force "more and more of the same" approach of prevous generations, it proposes a sensible new direction that may be a bit more complex, but promises tangible improvements.

We should not expect to see this type of design change in less than 2 or 4 years (based on the typical Intel design cycle). Keep an eye out for future announcements in this direction.