Eight Bit Codes

The ASCII character set was based on a 7 bit code that AT&T used to transmit teletype messages. Modern computers, however, store data in an eight bit byte. Expanding the character set by another bit would allow an additional 128 characters. However, based on telecommunications issues that are no longer meaningful, the International Standards Organization required that the first 32 new characters be reserved for communications control functions just as the first 32 characters had been. This left room for 96 new graphic characters

Western countries use the 26 letters of the Latin alphabet inherited from Rome. Other alphabets include Greek, Cyrillic, Arabic, Hebrew, and Thai. Some languages use the Latin alphabet but add additional accent marks (also called “diacritical” marks) to certain letters.

Within the limit of 92 characters, it is possible to add a second alphabet, or a reasonable number of accented Latin letters. Rather than creating separate character sets for individual languages, each standard groups several languages geographically. These standards are part of the family of ISO (International Standards Organization) 8859 project. Each standard in the family defines both a character set ( such as “Latin 1”) and an assignment of one byte codes to each character. Remember, the first half of each of these standards is identical to ASCII:

  • Latin-1 (ISO 8859-1) contains all the characters needed for English, French, Spanish, Italian, German, Swedish, Icelandic, and basically all the other languages used in Western Europe.
  • Latin-2 (ISO 8859-2) contains the characters needed for Polish, Czech, Hungarian, Romanian, Croatian, Slovak, Slovenian, and other languages of Eastern Europe except for the Baltic states.
  • Cyrillic (ISO 8859-5) Russian, Bulgarian, Macedonian, and other Russian influenced languages.
  • Arabic (ISO 8859-6) North Africa and the Middle East.
  • Greek (ISO 8859-7)
  • Hebrew (ISO 8859-8)

Arabic requires a special note. There is no “printed” form of Arabic. It is, instead, a “cursive” text like handwriting. Arabic is written right to left, and each character has four different forms based on its position in a word. For example:

  • ﮎ when the character appears by itself
  • ﮐ at the start of a word (connection to the next character on the left)
  • ﮑ in the middle of a word (connection from the right, continuing to the left)
  • ﮏ at the end of a word (connection from the right, ending decoration to the left)

The eight bit code of ISO 8859-6 assigns one code to any character. This is preferred for data stored on disk or in a database. However, for text in this code to be printed correctly, or even written correctly on the screen, the programming used must provide logic to determine the context of the character and select the correct display form. There are expanded characters sets for Arabic that assign different code points to each presentation form of the character. They may have been important when a stream of bytes was transmitted to a dumb device that had to display the information. However, in the modern era of microprocessors, every device should be able to select the proper presentation form from the single eight bit code.

Each new international standard has to be compatible (where possible) with the preivous standards, even when the original reason for those standards no longer applies to current technology. For example, the character whose code value is 127 is reserved as a control character named DEL. Back in the days of Teletype machines, information was punched on paper tape. If an operator hit the wrong key and punched the wrong character, there was no way to un-punch the holes in the tape. However, one could back the tape up one position, hit the DEL key, and punch out all 7 holes. DEL or 127 corresponds to the binary value of 1111111 (seven ones). So this code value was reserved as an additional control character. When information was transmitted or copied from one paper tape to another, the DEL characters on the input tape were skipped.

DEL and the other 64 control characters take up a lot of room in the limited range of values from 0 to 255 that can be stored in a single byte. In modern computers, where data is entered and corrected with screen editors, where information is transmitted over the Internet, where rich formatting is driven by HTML tags, and where printed pages are formatted by Windows or Macintosh desktop publishing, most of the control characters serve no useful purpose. The ISO standards cannot, however, reclaim the reserved code values.

This becomes a problem for languages that can fit into a one byte data space of 256 characters, but not into the ISO 8859 limit of 192 characters. Vietnamese, for example, has too many different versions of accented characters and so is frequently stored in a non-standard eight bit code.

The video adapter design for the first IBM PC included strange graphic characters (smiley face, heart, spade, club, diamond, etc.) to use every one of the 256 possible byte values. Subsequent adapters could also be programmed with alternate arrangements of graphic characters. IBM referred to these as “Code Pages”, perhaps to not confuse them with standard character sets like 8859-1.

The current design of Windows preserves the idea of a Code Page, although Microsoft implements them as a proper superset of ISO standards. The ability to assign graphic characters as a substitute for the no longer meaningful control characters gives Microsoft the opportunity to support more languages, more quickly, and more completely than the slow moving standards process. Today, Web content from Latvia, Lithuania, and Estonia is as likely to be in the “WinBaltic” code page as in any ISO standard.

Fifteen years ago the Latin 1 alphabet and the ISO 8859-1 standard covered most of the computers in use outside Japan. Eastern Europe was behind the Iron Curtain, and computer networks were mostly national in scope. So 8859-1 became the default character set in a wide range of applications and systems, from the PostScript printer system to the HTTP network protocol. In the last five years it has become clear that no eight bit code is broad enough to remain as a default.