Experts have developed a certain precise vocabulary to explain the issues of international text processing. Unfortunately, the developers of Web standards have not used these terms with precision or consistency. It is too late to go back and fix the errors, but the first step to understand the mess is to define the terms.
The term “character” refers to an abstract concept (and not a numeric computer code, a physical mark on paper, or a bit pattern displayed on the screen). A character is more about meaning than shape. For example, the capital letter G can be printed or written in cursive script. It may appear illuminated in a Medieval manuscript. Its still the same letter and has the same meaning in words.
The various forms in which the character can be represented physically are given the technical name glyphs.
Uppercase “G” and lowercase “g” may be the same letter, but they are different characters because they have different semantics. There is something about proper names, something about the start of sentences, and even some case sensitivity in Unix and C. The difference between the printed and cursive capital letters, however, is a matter of display and in computer systems is determined by the font and not the character.
Looks can be deceiving. Consider the characters “A”, “Α”, and “А”. The first is our capital letter A. The second is Greek and the third is Cyrillic. They are three distinct characters from different alphabets, although they look exactly the same. Σ is a character in the Greek alphabet, while ∑ is the mathematical sigma used to sum a set of terms in a mathematical equation.
Sometimes different characters can mean exactly the same thing, but are displayed differently. Consider the simple quotation mark (”). Word processors often replace it based on context with the separate left and right quotation mark characters (“ and ”). In some countries of Europe, however, quotations are delimited by a different form of quotation mark (« and »).
Characters can also be formed to represent typographical versions of combinations of characters. For example, ½ is a character called “the vulgar fraction one half”. It is a single character, as distinct from the three character sequence 1/2 that looks enough like it to pass for all practical purposes. The German character “ß” is a substitute for the two letter sequence “ss”, and the single character “œ” is used in some languages as a substitute for the two characters “oe”.
However, we have been forced by first typewriters and then computers to make do with a limited number of characters. In some cases a single character has been used for two completely different meanings. If it had been less important, these two meanings might be expressed today by two different characters.
For example, the dash “-” is sometimes used as a hyphen between words and syllables, though it is also used as the minus operator in mathematical expressions. Modern expanded character sets have several characters that are alternate forms of the hyphen, and other characters that alternate forms of minus. However, the plain ASCII character “-” on every keyboard is ambiguous. Its meaning is determined by the context in which it is used.
A character set is a collection of characters that can be entered, stored, or displayed by a program. Computers don’t provide support by the individual character. Instead, support for an entire set is installed in one operation.
The characters that everyone uses to design Web pages or program computers is most commonly called “ASCII”. This stands for the “American Standard Code for Information Interchange” and the “American” part shows that ASCII is very much a US standard. It contains 95 “graphic” characters (with the qualification that “blank” is regarded as a graphic character because it takes up space).
Thirty years ago, when computer equipment and communications were less powerful, some foreign language support was achieved by replacing some characters in the ASCII designated as “national use characters” with other characters needed in other languages. The most obvious target is “$” which could plausibly be replaced by the symbol for a different currency. If dollars remained important, then “#” was a popular second choice for the local currency. However, for at least the last 15 years all equipment and networks have been able to support larger sets with more characters. Today the full US ASCII character set is the universal starting point, and foreign character sets are created by expanding ASCII with additional characters rather than by substitution.
When there were tight limits on the number of characters, computer vendors created character sets that were not just targeted to specific languages, but also to specific countries. For example, an IBM character set for France was slightly different than the set for Canadian French. Now that computer networks connect every home user to every country in the world, such intense specialization is inefficient.
The same element typically appears in more than one language. For example, the cedilla mark under a c (ç) is most commonly recognized as a feature of the French language, but it is also used in Albanian, Catalan, Portuguese, and Turkish. It is more efficient to develop character sets that cover a broad range of related languages that share common characters.
There are standard character sets that provide all the characters needed for particular languages or regions. The most popular extended character set is called “Latin 1” and contains characters needed for Western languages:
Latin1 covers most West European languages, such as French (fr), Spanish (es), Catalan (ca), Basque (eu), Portuguese (pt), Italian (it), Albanian (sq), Rhaeto-Romanic (rm), Dutch (nl), German (de), Danish (da), Swedish (sv), Norwegian (no), Finnish (fi), Faroese (fo), Icelandic (is), Irish (ga), Scottish (gd), and English (en), incidentally also Afrikaans (af) and Swahili (sw), thus in effect also the entire American continent, Australia and much of Africa. The most notable exceptions are Zulu (zu) and other Bantu languages using Latin Extended-B letters, and of course Arabic in North Africa, and Guarani (gn) missing GEIUY with ~ tilde. The lack of the ligatures Dutch IJ, French OE and „German“ quotation marks is considered tolerable. The lack of the new C=-resembling Euro currency symbol U+20AC has opened the discussion of a new Latin0. [From “ISO 8859 Alphabet Soup”]
The Latin 1 character set is small enough that each character can be assigned a code and still stay within the limitation of one byte of storage per character. However, every decision to subset involves some controversy. There is not quite enough room in the set to support the major Western languages (French, Spanish, German, etc.) and still squeeze in both Icelandic and Turkish. Any rational empirical decision would note that there are 278 thousand people in Iceland compared to 66.6 million people in Turkey. In one of the most disgraceful examples of geographical bias, the Latin 1 set, which became the default for most computer applications, decided to exclude Turkey in favor of Iceland. Of course there is another Latin set that includes Turkey and excludes Iceland, but it is not a widely used default.
Character sets provide a workable subset of characters for a particular set of countries. Some sets include one additional alphabet (Greek, Cyrillic, Arabic, Hebrew). Others include accented characters for a particular region.
Each character in a character set has to be associated with a number that can be stored in computer memory. There are a number of standards that both define the characters in the set and assign each character a number. However, strictly speaking the selection of the characters in the set is one step, the assignment of codes is a separate step.
This was more important 15 years ago when mainframes were a larger influence in computing. Due to historical accident, the IBM mainframes had developed different code assignments for characters than the standard used on personal computers and the Internet. IBM supported the Latin 1 character set, but inside the mainframe each character was stored with a different code value. The internal mainframe codes could be easily translated to the external standard when data was transmitted over the Internet, and Internet data could be translated back when it was stored on the mainframe.
The last time that anyone screwed something important up in this area was during the design of the IBM PC. The Latin 1 character set had been clearly established, so the engineers knew what characters had to be displayed. The international standard code values were also available, but the engineers either did not know or ignored them. A story is told that with a deadline looming, the characters were more or less randomly assigned to code values during the airplane trip to the meeting when a final character table had to be presented.
This problem disappeared when DOS was replaced by Windows. In Windows, everything displayed on the screen is generated by software, so any or all standard character code systems can be supported natively. Portable software technology, like Netscape and Java, are also based on international standards. So modern software now supports the code values assigned by formal standards.
Character data is stored on disk or transmitted over the network as a stream of bytes. When the characters are all ASCII, then the byte values stored or transmitted correspond exactly to the character code values. The letter “A” will be represented as a byte with a numeric value of 65, the code value assigned to that letter. This approach works as long as the character set is small enough that the highest code value is less than 256.
However, the World Wide Web, in order to actually be “world wide”, requires a standard character set that includes all the world’s languages. Therefore, the HTML 4.0 standard is defined over the “Unicode” character set. Unicode supports every alphabet, every accented characters, and even the ideographs of Chinese and Japanese. To do this, however, most characters are assigned a two byte code value, with some characters in less common sets requiring a four byte code.
The data of the page is Unicode, which means that every character is represented by a two byte value. However, most of the characters are ASCII with a value less than 127. So to save space, Unicode is often stored on disk or transmitted over the Internet using the UTF-8 encoding. In a UTF-8 file:
- The ASCII characters less than 128 are transmitted in one byte. When you read the file, if the next byte is less than 128, it is an ASCII character. However, to make the ASCII characters shorter (one byte instead of two) some characters have to be made larger.
- The accented characters of European languages and the other alphabets of Russian, Greek, Hebrew, and Arabic are transmitted in two bytes. If the first byte has a value >128 then it is the start of a multibyte sequence.
- Chinese, Japanese, and Korean characters that are Unicode two byte characters have to be transmitted as three bytes per character.
UTF-8 has become the modern default, but you are under no obligation to use it. The other alternative is just to simply store each character as two bytes, an encoding called UCS-16. If a page or file has more Chinese characters than ASCII, UCS-16 may make it smaller. If it has more ASCII than Chinese characters, the UTF-8 will be more efficient. German newspapers transmit their pages in ISO 8859-1 because it guarantees one byte per character for all German language text.
No matter how nationalistic the Chinese may be, they also have groups of “graphic designers” and “user interface specialists” who agonize about the size and placement of text boxes, background colors, and margins. They may use a Web editor or site designer where they never see anything but the finished page, but under the covers these tools generate tons of HTML tags and CSS styles, all expressed in ASCII. That is why UTF-8 is empirically a sensible default.
HTML and XML define “numeric entities” that allow a programmer to represent any character in the Unicode set without leaving ASCII. For example, to put ΦΒΚ (Phi Beta Kappa) into a Web page, the three Greek letters can be expressed in a pure ASCII HTML file as “&#” followed by the decimal numeric code values (or “x” followed by the hex numeric code) assigned to the letters in Unicode, with an ending semicolon. Thus Phi Beta Kappa is represented by “ΦΒΚ”. This example required six ASCII characters to represent each special character. This is reasonable when most of the text is English and foreign characters are unusual.