Unicode and ISO 10646 | pclt.sites.yale.edu

During the 1980s a group of computer companies were working on a single universal character code that could combine all the characters of all the languages in the world. They figured out a trick to keep the code down to a two byte value.

If you combine one existing character set for each of the three Far East languages, you end up with a set much larger than the 65536 possible values in a two byte code. However, a very large number of the three ideographic languages can be made to share the same code values if you adopt a certain historical perspective.

All Far East languages descend from an original written script developed during the Han dynasty in China. Down through the centuries the form and some of the meanings of the ideographs drifted apart in the three languages, and today the relationship cannot be easily identified. The idea was to assign a single code value to three ideographs in the three languages that appeared to descend from a common ancestor. This still left a few thousand more “modern” characters to fill in, but it kept the total number of code points within the two byte limit.

Requests subsequently appeared for a number of less important characters that were missed in the first pass. This pushed the character set beyond the two byte boundary. However, almost any important modern text can be expressed by staying inside the two byte limit.

Around the same time, a committee of ISO was also looking for a single unified code. They were aware of the Unicode idea, but the ISO membership is countries instead of companies. Initially each Far East country wanted to directly embed one of their traditional multi-byte codes in the final standard. The initial recommendation was monstrous.

The complete ISO group rejected this proposal and sent instructions back to the committee to look more carefully at Unicode. The combined efforts of the two groups improved the standard so today “Unicode” and “ISO 10646” are two different names for the same thing.

Complete information on Unicode can be found on the www.unicode.org site.

The simplest way to process a stream of Unicode characters is to store each character in a two byte field. If the data is predominately ASCII, either because it contains mostly English language text or because of the HTML tags and JavaScript logic, then the data can be represented more compactly in a format called “UTF-8”. In this form, the ASCII characters 0-127 are represented as a single bytes. Any foreign character, even other Latin 1 characters in the range 128-255 are converted to a multi-byte sequence, and some characters will be expanded to three bytes.

Unless a text is predominately composed of Far East languages, UTF-8 is usually the most compact external form for Unicode. However, if a file consists exclusively of characters from one particular national language, then it is probably more efficient to use a local eight bit or legacy Far East code.

Because Unicode assigns a single code value for three Chinese, Japanese, and Korean ideographs that have quite different forms, a Web page that contains Far Eastern Unicode data must specify which of the three languages is associated with any block of text. HTML 4 adds the “lang” attribute to all HTML tags for this purpose.

If a paragraph begins with

then the Browser knows to display its Unicode contents in Japanese ideographs. Had the paragraph begun

then the Browser would have generated the same codes as Chinese ideographs.

XML has a similar attribute. However, since XML syntax is defined by the user or application and some preexisting object might already use an attribute called “lang” for some other purpose, the XML language selection attribute is qualified by a namespace as “xml:lang”. An object might include a tag:

...

Unicode characters that are part of the content of this tag and have code points in the ideographic character range would then be identified as Chinese. XML can only apply attributes to an entire tag contents, and there is no provision to distinguish one string of characters in a tag from another. Therefore, XML itself doesn’t allow a tag to contain a text phrase in Chinese and another text phrase in Japanese. HTML doesn’t have this problem because any HTML structure can contain a … structure. If someone is designing an XML schema and requires the ability for a tag to mix different type of ideographs, then the schema must permit some subsidiary tag that can delimit spans of national characters.