The GPU (Specialized Processors) | pclt.sites.yale.edu

Specialized processors do one type of operation very fast on large blocks of data. The Graphics Processing Unit in the video card can do certain specific tasks 10 times faster than the CPU.

Every CPU that has ever been built has the basic instructions to do arithmetic, move data around in memory, compare and test numbers, and run programs. No CPU is smarter than any other processor in the sense that it can do something special. Processors are simply faster or slower at doing the same task. This means that your desktop, laptop, or cell phone has the instructions needed to do word processing or solve complex Physics problems. Twenty years ago every Fortune 500 corporation did all its data processing on a computer that is slower and had less memory than the cheapest system you can buy at Costco. In fact, your iPod may be more powerful.

However, a small number of problems require so much power that they can strain an ordinary computer. Decompressing the HD movie on a Blu-Ray disk fast enough to display the movie on the screen at proper speed can max out a Dual Core processor. Fortunately, there is another type of processor that can handle the problem more efficiently.

If you buy a loaf of bread, you can cut it into slices with a knife. That is fine if you want to create some thin slices and some thick slices. The bakery has a machine with dozens of evenly spaced slicers. You put the bread in the machine and it cuts it into slices in one operation. If you have to cut one slice, you need a knife. If you need to slice hundreds of loaves of bread, you need the machine.

In a restaurant, a “dishwasher” may be a guy who washes dishes or a machine where you stack the dishes and turn it on. The guy washes dishes one at a time, but he can do other jobs like cleaning the floor or taking out the garbage. The machine washes a hundred dishes at once “in parallel”, but it can only perform this one job.

If you go back ten or twenty years ago, specialized processors were sold on add-in boards to perform specific operations. The sound processing chip on an audio card has always provided better and more complex audio processing than you could perform with just the CPU. When DVD movies first came out, you needed a separate board with a specialized chip to decode them because the Pentium CPU wasn’t fast enough. Subsequent generations of general purpose processors could eventually solve each specific problem that justified some particular board, but specialized processors will always be faster at doing specific types of computing.

Displaying 3D video games or decoding HD TV movie streams is done more efficiently with the Graphics Processing Unit (GPU) on a modern video card. Just as Walt Disney assigned different artists to do the foreground characters, the background, and to fill in the colors, video cards just a few years back had specialized circuits for generating textures and shading. Around the time that Vista came out this approach changed, and modern video cards can have from 64 up to 800 identical “unified” processor circuits that can be assigned to perform any of the graphics tasks.

A video card can perform certain types of repetitive processing 10 or 20 times faster than even a Quad Core CPU. However, Intel did not entirely give up on their CPU design. Each new generation of CPU design includes an expanded set of instructions to do bulk processing on blocks of data. These SSE or SIMD (Single Instruction, Multiple Data) circuits speed multimedia processing for sound and video decoding. While the CPU has a little of this SIMD capability, the Graphics Processing Unit (GPU) on your video card has massively more of this type of processing capability.

Quick History Recap

Every computer instruction in every program has to go through a sequence of steps. The bits have to be decoded to determine which instruction is being requested. The computer may have to calculate the memory address of the data, the data has to be fetched, and then the operation can be performed. The first IBM PC CPU chip (the Intel 8088) had just enough circuitry to execute instructions, but it had to use some of the same circuits to perform each phase of the instruction. So it took many clock cycles to execute every instruction.

Moore’s Law says that the number of transistors doubles every 24 months. Over the first dozen years (the 8088, 286, 386, and 486 CPU chips) Intel used the extra transistors to create specialized circuits to perform each phase of the instruction decode. The 486 chip could execute one instruction per clock cycle. Then it dupicated all the processing units so that the first generation Pentium chip could execute two instructions per clock cycle.

The Pentium represented the last simple CPU chip design, where the instructions of a program execute in the order in which they are written. After the Pentium design, subsequent generations of CPU chips use the 2, 4, or 8 times as many circuits to opportunistically rearrange the order in which instructions execute or to try and predict whether the program will jump to a new location in memory. Each doubling of circuits makes the CPU faster, but with diminishing returns.

Which is why Intel returns to the Pentium chip design every time they need to create a specialized device. If you shoot a satellite into orbit, the computer chip that is hardened to survive radiation in space is a version of the original Pentium design instead of one of the later generations. The Intel Atom CPU is a very low power version of the Pentium design.

Alternative Designs

When Intel was looking for a new type of Graphics Processing Unit, it experimented with a design (Larrabee) that had up to 80 Pentium cores in a single chip. Each core had SIMD instruction capability and could run one of several simple shared programs to perform specific graphics function. Larrabee was cancelled before it became a product, but it remains a possible future Intel design.

The Sony PS3 is powered by the IBM Cell processor design. It has one general purpose core that runs the operating system and programs, and then there are six bulk processing specialized cores that do the calculations for the graphics and game logic.

AMD has been talking for several years about Fusion. The idea is to create a single chip with one or two CPU cores and then a GPU style array of hundreds of specialized graphics processing circuits. This could reduce the number of chips and resulting cost of a mid-range system, or it could reduce the power cost of a laptop.

The problem is that each of these specialized designs requires customized programming, and application programmers don’t have the time to do this complex work over and over again. AMD and Nvidia have released different programming interfaces and software development kits that allow applications to use some of the processing power in the video card. The big players (Intel and Microsoft) may produce a single standard interface which would be more attractive to software developers.

Multiple GPUs

Nvidia “SLI” and AMD “CrossFire” allow the user to plug more than one video card into a computer to increase the number of specialized video processing circuits available for 3D computation. If one GPU has 800 unified circuits, then two video cards could bump that to 1600. However, in order to coordinate processing, the two cards have to be able to communicate at high speed. Nvidia uses a “bridge” cable that connects the two cards, while AMD uses the mainboard PCIe bus.

The normal Chipset on a mainboard provides enough PCIe to support a single video card. To add many video card slots, the mainboard vendor must add an additional chip. A third party company (Lucid) produces a specialized chip (HYDRA) that not only generates the additional PCIe slots but also translates between Nvidia and AMD protocols. With this chip, the additional GPU chips do not have to all come from the same vendor.

SLI and CrossFire are used for video games and for certain types of engineering applications that are written to use GPU processing power. The rest of us just need modest GPU processing, which may be provided by a single card or even by integrated video. Normally, when you use the integrated video on the mainboard you plug your monitor into the DVI/HDMI/DP connectors on the board itself, but then when you plug in a video card it replaces the integrated mainboard video and you have to move the cables to the sockets on the back of the card.

A hardware feature called “Hybrid” video can change this. In Hybrid video, the integrated capability of the mainboard remains active and the monitors continue to be plugged into the mainboard sockets even after a video card is added. If you are just browsing the Web or viewing normal video files, the video card is “turned off” to save power and the integrated video on the mainboard handles the load. As soon as you start an application that needs additional power, the card is turned back on and it takes over the heavy duty video processing.

However, newer generations of video cards have adaptive power use. The card turns off sections of its own processors that are not needed and reduces power use during idle periods.

All these technologies are in play, and it is difficult to predict which will be successful and which will fail in future generations of computers