Memory and “Burst”

Technology has been applied to increase memory speed only when it can be done without reducing size or increasing cost. Current mass market designs favor Double Data Rate SDRAM. When a CPU instruction requires data from memory, it presents the address and then has to wait several cycles. Once the first block of data has been located by the memory hardware, the 32 bytes immediately surrounding the address can also be transferred in a “burst” of activity. DDR memory transfers the data at twice the ordinary speed of the memory bus by transferring bytes on both the tick and the tock of the clock.

The newspaper ad offers a computer system with a “2.2 GHz CPU” and “512 Megabytes of RAM”. We are interested in the speed of the CPU and the size of the memory.

The CPU keeps active data in registers. The most recently accessed memory is in the Level 1 Cache, while slightly less recently accessed memory is in the Level 2 Cache, both of which are in the CPU chip. Some chips even have a Level 3 Cache. However, eventually the program will need data that is not in the CPU chip, and then it will have to go to the main memory plugged into the mainboard.

The CPU generates no signal when it has to wait for data to come in from the memory. The CPU appears to be busy, but suddenly a simple instruction to add two numbers takes an amount of time in which the CPU would normally execute hundreds of instructions. The delay to fetch data from memory is long by CPU standards, but it is still far, far less time that it would take the OS to switch to another program or thread.

That is, unless the other thread has already been loaded into the CPU. Starting in 2004, Intel has sporadically offered CPUs with the hyperthreading feature The CPU appears to have twice as many cores as it actually has. The OS loads the registers and status of waiting programs into each apparent core. When one program has to stop to wait for data from memory, the other thread can take over and use the CPU until the data arrives. When both threads need data from memory, then the CPU stops doing useful work.

There are many memory performance numbers quoted by the vendors, and almost all of them are misleading. As with all the other subjects discussed in PCLT, the best way to understand the subject is to walk through the process step by step and explain what is going on in each step.

Remember that the CPU internally saves the most recently referenced memory in various types of Cache. Like any storage mechanism, Cache has to have some unit of storage. It could save individual bytes, but that would be horribly inefficient. The interface between an Intel CPU and the mainboard chipset is 64 bits (8 bytes). This means that the CPU cannot request less than 8 bytes of data at a time, so that is the next obvious possible storage unit. However, in modern CPU chips even 8 bytes is too small. Instead, CPUs store in cache a line of 32 or 64 contiguous bytes read from memory.

If the CPU transfers 8 bytes of data per front side bus cycle through its socket on the mainboard, and the line of cache is 64 bytes, then it takes 8 consecutive data transfers across the front side bus to transfer the complete line. This type of transfer is called a burst.

Memory is sold in DIMM modules, called a “stick of memory” in computer slang. The DIMM plugs into a memory bus that is also 8 bytes wide. Modern mainboards, however, have two parallel memory buses and therefore can transfer 16 bytes from memory in every data transfer cycle. Then new Intel Core i7 processors have three parallel memory buses and transfer 24 bytes, but that is a strange feature since the number of bytes in a line of cache is not evenly divisible by 24.

The Intel CPU socket transfers data four times per clock tick. If the processor bus is 200 MHz, then the effective memory rate is 800 MHz (4x200), a number that is also called the Front Side Bus (FSB). Since 2007, newer generations of Intel processors have clock speeds of 266, 333, or 400 producing FSB speeds of 1066, 1333, and 1600. However, all Intel processors before the Core i7 family transfer data through the Northbridge chip, which isolates the CPU FSB from the memory. So while the traditional CPU chip transferred 8 bytes of memory with an FSB of 800 MHz, the corresponding memory might transfer 16 bytes of data with a bus rate of 400 MHz. The two speeds match, after a manner of speaking, because the memory transfers twice as much data half as frequently.

This only applies to the transfer of consecutive bytes from the same memory location. When the program tries to access data from a new memory location, there will be a delay called the “memory latency”.

A memory DIMM has at least one bank of memory (some have two or more banks). Each bank operates as an independent device. It has its own buffers and maintains a logical “position” in memory. The bank is typically described as “rows” and “columns” and I guess we are stuck with that terminology the “row” is a chunk of contiguous memory (2K, 4K?) that can be held in the buffer at any one time. The memory controller selects a row by sending the high order half of the address, and the bank locates the row with this address and moves all the data from that row to the buffer. This requires a very long “memory latency” delay.

Suppose the bank holds 2048 bytes of data. The cache line is only 32 or 64 bytes. Once the burst completes, then in most cases it turns out that the next address that a program requests will be near to the last address. If the next memory reference is to an address in the row already loaded into the buffer, then the memory will only have to wait for a short number of cycles called the “CAS” delay. In modern 800 MHz memory, the CAS delay on commodity memory is typically 5 memory clock cycles. There is even better news, however, because the CAS delay can overlap some of the transfers that complete the previous burst. So if a program uses data compactly, then it can keep the memory bus busy. If it jumps around wildly to different random memory locations, then it will be waiting many, many cycles for the row of memory to load into the buffer and performance will suffer.

Now for a very fine point of mainboard design. There is no signal that the memory can send to ask the memory controller in the Northbridge to slow down or wait a minute. Every individual memory device has its own timings for transfer speed, CAS latency, row access, and so on. At power up, the memory controller obtains all these timings and parameters from the individual DIMM. After that, the memory controller is responsible for doing its own calculations about row access, CAS latency, pipelining, and all the rest. It must not send data before the memory is ready to receive it, and it must not expect memory sooner than the memory is ready to deliver it.

The good news is that, while you must use the particular type of memory (DDR, DDR-2, DDR-3) that your mainboard requires, you can always substitute faster memory than your system expects. When memory is new, faster memory sells at a premium. However, if you are upgrading a system you have had for a year or two, you will often find no price difference between different speeds of older technology. Memory rated at the highest speed will have tested to be more stable and reliable, but it will still run at slower speeds. A DIMM can tell the memory controller to use shorter timing values (like CAS latency) on a slower bus speed. If it doesn’t, however, then the memory controller might pick up the higher latency numbers intended for use on the faster bus, use them on the slower bus and degrade performance. You can override them with manual settings in the startup BIOS windows, or you can just buy upgrade memory that is identical to the original memory, even if it costs more.

While database servers can use seemingly unlimited amounts of memory, a modern desktop processor (even one running Vista) may have difficulty using more than 2 gigabytes of main memory. There simply are no desktop applications that require more memory. For a while vendors have used new generations of chip technology to reduce the number of chips on a DIMM and therefore the cost of memory. However, at some point the memory vendors may apply the extra transistors to modest performance gains.

A Few Terms

Synchronous

In the first generation IBM PC, DRAM memory transferred one unit of memory for every CPU request. The CPU presented an address, the memory responded with data. The CPU presented another address, the memory responded with another unit of data. This worked well because the CPU and memory ran at essentially the same speed.

An operation that can only proceed when the sender and receiver both indicate that they are ready is said to be “asynchronous”. It runs at the speed of the slower of the two ends. Baseball is an asynchronous game.  The pitcher can take his time, look at the runner on first, get signals from the catcher. If the batter needs more time he can step back out of the box, stretch, and rub something on his hands. Only when the batter is in the box and the pitcher starts his windup can we really expect a pitch.

There is another mode of operation represented by the pitching machine in a batting cage. The machine delivers balls regularly and mechanically, whether the batter is ready or not. When something happens at a regular rate, driven by a clock, then computer experts call it a “synchronous” operation.

We talk about modern memory as Synchronous DRAM (SDRAM), although strictly speaking this applies only to the transfer of consecutive blocks of 8 bytes during the burst. Remember, the CPU is actually reading a 32 or 64 byte line of cache from some address in memory. Between bursts, the CPU still decides when it needs the next chunk of data from memory and the memory waits patiently until it gets that signal.

Dynamic

Dynamic Random Access Memory (DRAM) stores data in an electronic circuit called a capacitor. A capacitor holds a certain amount of electric current. It is commonly compared to a bucket of water. If the bucket is full, this represents a 1. If the bucket is empty, it is 0. The problem is that capacitors leak, like a bucket with a small hole in it. Over time, the full bucket becomes 3/4 full, then half full. So periodically the memory chip has to read all the data in a row and then rewrite it, thus refilling all the buckets that represent a 1. This is called a refresh cycle and it slows down memory performance.

The alternative is Static Memory which doesn’t require a refresh cycle. The problem is that Static Memory takes up 4 times the chip space, or alternately you can get only 1/4 as much memory for the same amount of money. This has never seemed like a good trade-off, and there doesn’t appear to be any change in this expected any time soon.

DDR-1, DDR-2, DDR-3

The current standard is DDR-2 memory. DDR-2 memory is DDR-1 memory with a few slight tweaks. Now superficially it sounds like it runs a lot faster. DDR-2 memory can run at 800 MHz while DDR-1 typically stops at 400 MHz. However, if you look carefully you will often see (at affordable prices) that the CAS latency and other timing values for DDR-2 800 MHz is exactly twice the values for DDR-1 400 MHz. So it may be the exact same memory connected to a faster bus. There are also some changes to the bus structure to run more efficiently. Each new generation is faster, but not a whole lot faster.

Fully Buffered, Registered, etc.

A desktop mainboard has 4 memory slots representing two memory buses with two slots each. Some boards can accept 4 DIMMs with 2 gigabytes of memory per DIMM for a total of 8 gigabytes, but other boards max out at 4 gigabytes. If you try to run with the very highest memory speed the board supports, the memory controller may only be able to handle one DIMM per bus and max out at 2 gigabytes.

Meanwhile, there is no limit to the amount of memory that a serious Database Server can use. Mainboards designed to be used as servers may have more memory slots. Typically this memory will run one or two speeds slower than the fastest desktop memory and it will have some additional electronics (FB, registered) to allow reliable operation with more than just one or two DIMMs per bus.

What to Buy

If you plan on running Vista, you want a computer with 2 Gigabytes of RAM. You can buy 4 gigs of medium speed memory for $50. Getting the very fastest memory (highest clock speed and lowest latency) can cost noticeably more.

The memory vendors quote the fastest speed at which the memory operates, but they only test it with certain board configurations. The mainboard vendors quote the fastest clock speed at which their board operates, but they only test it with some memory. Sometimes there is fine print, where the board supports the fastest memory speed, but only if you use only two of the four memory slots.

If you insist on trying to get the very fastest memory speed possible, then you have to make sure the memory you want works with the board you want and read all the fine print. That is a lot of work to pay for the privilege of spending a lot more money to get to get what turns out to be a very small performance benefit.

Or you can buy a lot of decent but cheap memory and use the money and time you save to do something else.

The most important memory feature is one that most desktop users are unable to select. Memory can come with ECC error checking. This will detect a problem in the memory itself, but it will also detect sporadic problems caused by mismatches with the board. If you don’t have ECC memory, then memory problems show up as corrupted data and cause your programs and OS to crash in all sorts of random ways. To use ECC, you not only have to get the feature in the memory stick but it also has to be supported by the mainboard, and most mainboard vendors only support it on server configurations. ECC costs a few bucks more, but that is a lot cheaper than the hours or days that you spend trying to track down a problem that initially appears to be a software problem but is ultimately resolved as a memory problem.