The Multi-National Page

The Content-Type and character encoding allow the browser to do the right thing when displaying one page in French and another page in Chinese. Suppose you have a page with sections or quotes in several different languages? How do you aggregate or edit text from many different sources and display them through the same page and byte stream?

In some sense, the modern Web standards have evolved step by step in this direction. While the original HTTP standard for the first version of the Web defaulted to ISO 8859-1 (Americas, Western Europe), most current Web standards assume the Unicode character set and the UTF-8 encoding when nothing is specified. That is also the default for modern editors and tools.

Which leaves the two imponderable questions in XML and Unicode.

Which Far Eastern Language?

In order to have a rational number of characters, Unicode overlays the three Far Eastern languages (Chinese, Japanese, Korean) over a common set of character values. If you have a block of characters in this range of values, there is nothing in the bytes themselves to tell you if you which of the three languages you should use to display the text.

Of course, if you know the geographic origin of a file (Tokyo for example) or if the data was originally imported from a national character set (Shift JIS instead of Unicode) then you can make a guess of the national origin (Japanese). Once, however, the text stops being a file or URL all by itself and has been merged into a larger multi-national document, it becomes necessary for the paragraph or section to be marked with the national choice.

XML and HTML provide the attribute (where x is p, div, span, etc.). Although it can be used for languages like French or English, it really matters for blocks of text in any of the Far East languages.

Direction

Hebrew and Arabic are right to left languages. Not only are the characters written right to left, but paragraphs align against the left margin.

A paragraph is a mixture of English and Hebrew. If you regard it as English text with a few Hebrew words, then it should align to the left margin. If you regard it as Hebrew with some English words, then it should alight to the right margin.

You can try and solve this problem with some version of an “align” keyword or with style and CSS. That is a mistake. This is a matter of the semantic content of the paragraph and not a question of graphic layout.

You can solve the problem by marking the paragraph with the lang=”he” attribute if the primary langauge is Hebrew. Of course, the problem is that the paragraph is really a mixture of two languages, so the attribute really tells which is primary.

The alternative is to put the dir=”rtl” attribute on paragraphs or DIVs where the primary langauge is one fo the Right to Left languages. The best approach may be to use both lang and dir attributes.