Localization and Internationalization

Computers are configured to operate in the local language of the country in which they are installed. Keyboards not only support national characters and accents, but the layout of keys on the keyboard vary from country to country (“QWERTY” in the US, “AZERTY” in France, “QWERTZ” in Germany). Utilities expect the local character set, and files are probably written to disk in the same national code used ten years ago.

There may still be some equipment that substitutes local characters for the “national use” characters of ASCII. However, the rapid replacement of computers and printers has probably migrated most use outside of the Far East to one of the eight bit character sets. In the Far East, the local character set will be one of the traditional multi-byte character sets (in Japan, for example, either EUC or Shift-JIS).

Most foreign language files are exchanged with other people in the same country or region. So if both the producer and consumer of the text default to the same character set and encoding, the question of international standards doesn’t arise.

It would not have been necessary for the Web standards to move to Unicode just to display text from different countries on the same page. A variety of HTML constructs (IMG, IFRAME, OBJECT) allow a “Web page” to be composed from different sources each with its own format. The IMG, for example, references an external image file in GIF or JPEG format. If it was necessary for the Browser to combine data from incompatible character sets, each block of text could be transmitted from the Web Server with its own Content-Type header with its own “charset” designation.

However, in the HTML 4 world of dynamic content manipulated through the CSS and the DOM, it would be almost impossible for a Browser to manage content unless it had all been reduced internally to a common character set. Given the state of modern technology, the Browser has to be programmed to use Unicode.

If the Browser had to use Unicode internally, then the HTML and XML standards might just as well accept Unicode as the reference character set in which each markup language standard is defined. Now a single Web page can contain text from as many different languages using as many different character sets as the author might choose, without the requirement that each different language segment be isolated in its own file.