Hyperthreading and Multi-Core | pclt.sites.yale.edu

When you browse books at Amazon, thousands of other Web users are also doing the same thing. Every time you click on a link or push a button, the Server computer has to perform some small amount of work on behalf of your Browser. Dozens of other users happen to have also clicked on something at the same time, and all these requests are in the Server being processed at the same time. The work that the server has to do on behalf of any one Browser request is a “thread”, an sequence of instructions needed to perform a specific unit of work largely independent of other units of work active at the same time. Even in a program on a single desktop machine, there can be independent threads of work. In the browser, the operations needed to display each individual picture or ad on the screen can be a thread. A CPU can do a small amount of work on one thread, set it aside temporarily, and do a small amount of work on the next thread. However, modern CPU chips have lots of transistors, and they have been designed to execute two or more threads at the same time.

Every Web page you view has separate columns. There are some pictures, some ads, maybe a link to a Youtube video. In reality, every ad, picture, and video is a separate file, often from a separate file server somewhere in the Internet. The Browser has to download each of these files separately, compose the bytes of the file into text or pictures, and then arrange them on the page. It can do each of these things one at a time, but there is no particular order required for downloading the components that will make up the final page.

So a modern Browser (and most other modern applications) divides its work up into “threads”. Within a thread there is a specific order to things. When you are downloading the data for a picture, it will start in the upper left corner and then proceed right and down. However, there is no particular order to the page as a whole. You can download the picture in the upper right corner of the page, or even start with a picture that will end up in the middle.

Some kind of threads existed from the beginning of Windows. Different devices and different network servers operate at different speeds. If your DSL line or cable modem connects you to the Internet at 500K bytes per second, and one of the network servers is delivering you the bytes of a picture at 30K bytes per second, then you have 470K bytes per second of Internet capability free to download other pictures and files. However, in the early days, a computer could only do one thing at a time, so it “muli-tasked”. Like an office worker shuffling paper, responding to questions from other workers, and answering the phone when it rings, the computer did some processing for one thread until that thread ran out of data and had to wait for more to arrive from the network, then the computer switched to process another thread, and so on.

Eventually, Intel and AMD ran up against a limit on CPU clock speed. They could not make the next generation of CPUs run as fast as they wanted, but since the size of computer circuits drops in half every two years they were able to produce a “dual-core” CPU that was really two CPUs in a single chip. Now the CPU chip could run two threads at the same time, one on each of the internal processing cores.

Single Core, Hyperthreading, Shared Resources, Dual Core

There are several intermediate steps between a single CPU core that can run one program at a time and a dual Core processor that has two of everything and can run two things at a time at full speed.

The first idea was introduced by Microsoft and is called Hyperthreading. The CPU is fast, but memory is slow to respond. When the CPU requires instructions or data to be fetched from the memory chips, it has to wait doing nothing at all for a period of time in which it could have executed hundreds of instructions before the first byte of data comes in from memory. While this period of time is long from the point of view of the CPU, it is far too short for the operating system to respond. If the CPU is to be given something else to do to use up its otherwise idle time waiting for memory, that extra work has to already be loaded into the CPU ready to run.

With Hyperthreading, a CPU core tells the operating system that it is two processors able to run two threads at the same time. If the operating system has enough active programs, it assigns two threads to the core and loads their data and current instruction pointer into the core. I reality, the core can only run one program at a time, but now it has two different programs to choose from. If both programs are running normally, it executes some instructions for the first program and then some instructions for the second, switching back and forth. However, if one of the two programs needs data from memory, then while that program is blocked by the missing data, the core can be dedicated to run the other program. When both programs are blocked needing data from memory, then the CPU becomes idle until memory responds. However, with Hyperthreading this happens less frequently and the CPU is more efficient.

To implement Hyperthreading, the CPU only has to duplicate the part of its circuitry that holds the data and the current instruction location of the currently running program.

In August 2010 AMD announced its “Bulldozer” architecture. It provides an intermediate step between Hyperthreading and duplicating the entire computer core. It makes sense if you remember something about computer history.

There are two types of numbers in a computer. Integers are whole numbers (1,2,3,…) and currency where there is a fixed decimal point ($12.37 can be regarded as 1237 pennies). Floating point numbers have long fractional parts and are used in scientific calculations (3.14159).

In the 1960’s the first generation of IBM mainframes had optional floating point hardware because business calculations were almost exclusively integers. In the 1980’s, the first generation of IBM PCs had an optional floating point coprocessor chip, again because floating point was needed only for scientific applications. However, with the number of transistors doubling every year it eventually made sense to make floating point a standard part of every CPU rather than offering it as a separate option.

Floating point hardware, including the SSE and other specialized instructions used by multimedia and games, take up a lot of space in each CPU core. Integer processing is much simpler and smaller. While this hardware is the most important component of supercomputers that do weather forecasting or analyze data from an atom smasher, it is almost never used in the server computer at Amazon that you use when you are browsing books or CDs.

So AMD will be shipping a new generation of server CPU chips that allow the operating system to assign extra programs to a CPU core, but unlike Hyperthreading provide each such program the ability to execute concurrently and at full speed everything except floating point instructions. A floating point processor will be in each core, but it will be shared by the threads that are running in it and they will have to take turns using it.

This is a catastrophically bad design for scientific supercomputers, but an excellent idea for Web or database servers. It gets more work done in the same amount of chip space, and reduces the power, heat, and real estate consumed by extra floating point circuits that don’t get used by business applications.

Program Threads

Consider the problem of cooking for a big dinner party. Each dish has its own recipe. You could follow the instructions in one recipe until that one dish is done, then set it aside and start the next dish. Unfortunately, it would take several days to cook the dinner, and everything would come out cold. Fortunately, there are long periods of time when something sits in the oven, and while it is cooking you can prepare one or two other things.

A sequence of instructions to do one thing is called a “recipe” in the kitchen, and a “thread” in computer programming. A computer user intuitively understands the behavior of threads when running several programs on the screen, or when listening to an MP3 file in the background while typing a letter into the word processor. Even a single program can make use of threads. The Browsers has separate threads for every file or image you are downloading, and it may assign a separate thread to decode each image or banner ad that appears on the screen when you visit the New York Times web site.

Some short operations have a very high priority. For example, a pot of rice you just started has to be checked every 30 seconds of so to see if it has come to a full boil. At that point the heat can be turned down, the pot can be covered, and now you can forget it for 15 minutes. However, if you don’t check it regularly at first, it will boil over, make a mess on the stove, and you have to start over.

Computer programs also assign a priority to their threads. As with cooking, high priority can only be assigned to trivial tasks that can be accomplished in almost no time at all. Just as a kitchen has to have timers, and a beep when the microwave is done, so the operating system has to have support for program threads and the ability to connect them to timers and to events signaled when data arrives from the network or another device.

In the kitchen, each task you perform has its own set of tools. To chop carrots, you need a knife and a cutting board. To take something from the oven, you need oven mittens. It takes some small amount of time to set down what you are doing and change. If you don’t change, you will find it is very difficult to cut carrots while wearing oven mittens.

Each thread in the computer stores its status and data in the CPU chip. To switch threads, the operating system has to take this data out of the CPU, store it away, and load up data for the other thread. Switching from one thread to another takes a few hundred instructions, but this is not a problem when the CPU can execute billions of instructions a second while a hard drive or network performs only about 30 operations per second. The overhead of thread switching for I/O is trivial.

If it is a big complicated dinner that one person can simply not get done in time, you need some help. Specific tasks can be assigned to different people. The threads don’t change. The bread is still cooked the same way whether there is one person in the kitchen or two. With two people, however, one can chop carrots while the other peels potatoes.

A single core CPU runs only one thread at a time. However, the CPU runs so fast that it can switch threads hundreds of times a second. From the point of view of a human user, many different programs are running at once. If the human could speed up to match the speed of the computer, he would see that only one thing is running at a time. However, many current computers support more than one CPU core. Each core is an independent CPU and can run any thread in the system. When threads are ready to run, the operating system will assign one thread to every available CPU core. Now two or more threads are actually running concurrently.

Co(re)ordination

Two programs are running on your computer. While they mostly do different things, they may both store data on the same disk and they both display output on the same screen. Internally, the operating system must coordinate their concurrent access to shared resources. At the hardware level, each CPU core must coordinate access to memory and to the I/O devices.

In the old days, the Northbridge chip controlled access to resources. Then AMD came up with a better design. Inside their CPU chips there was an “XBar” component that connected the cores to each other and to the memory controller and I/O connections. This allowed the cores on a single CPU chip to coordinate their activity without the delay of communicating with an external chip. Finally, in mid 2009 Intel adopted a version of the same design for its chips.