Bidirectional Character Sets

In addition to geopolitical problems, the Middle East provides “left to right” languages that pose special presentation problems.

Hebrew and Arabic characters are read and written starting at the right margin and moving left. However, the ISO 8859-6 and -8 standards also include the Roman alphabet to represent HTML tags and programming language source. Computer source typically has a mixture of both types of characters, and tools that support such a mixture are said to be “bidirectional”.

Programs don’t have a problem. If instead of reading text from the screen you were able to open the file and read the data from disk or from the network one character at a time, everything arrives in exactly the correct order. As data, the first character arrives first and the first word has arrived when a blank is encountered. Latin characters arrive in the order they would be read. Hebrew and Arabic characters arrive in the order they would be read. The period at the end of the sentence arrives at the end of the sentence. That may not seem remarkable, but now consider how this perfectly reasonable binary stream of data gets messed up when it has to be presented to the human eye.

Consider the following sentence borrowed from the description of Unicode at the www.unicode.org site. It appears first in English and then in Hebrew. Hebrew sections are color matched to the corresponding English translation:

Unicode is required by modern standards such as XML, Java, ECMAScript(JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646

יוניקוד נדרש על-ידי תקנים מודרניים כמו XML, Java, ECMAScript(JavaScript), LDAP, CORBA 3.0, WML וכדומה, ומהווה למעשה את היישום הרשמי של תקן ISO/IEC 10646.

The second paragraph is justified to the right margin and the contents of the HTML tag are displayed starting on the left and moving to the right. To achieve this display, the paragraph tag contains the dir=”rtl” attribute (for direction is right-to-left).

To read a RTL block of mixed text, read each line right to left, top to bottom. Start at the right margin and look for the first (rightmost) block of contiguous Hebrew or Latin characters. If the block is Hebrew, read the characters right to left. If the block is Latin, read the characters left to right. When all the characters in the block are read, if there are more characters to the left of the characters you just read find the next block. If not, start at the right margin of the next line and find the next block.

To see what this really means, make the Browser Window wider or narrower and see how the characters flow. There is some strange behavior. For example, if the leftmost (last) characters on one line are Latin characters and the rightmost (first) characters on the next line are also Latin, then as you squeeze the window to make it narrower text “flows” from the middle of the top line to the middle of the next line.

Numbers, however, are a problem The characters we use for 0 to 9 are called “Arabic numerals”, but they are still read left to right and are the same in any language. Algorithmically, numbers are not regarded as part of the Latin or any other alphabet. This can cause a problem with some Browsers when the “3.0” after “CORBA” or the “`10646” wrap to the next line. Note also that the period that ends the sentence comes at the end (leftmost) even though the text that immediately precedes it is Latin and therefore its characters were scanned left to right.

All of this is presentation. The algorithm to display the characters on the screen is complex. The reader has to follow a complex eyeball algorithm to scan the characters. On disk, in memory, or on the network, the bytes and characters all arrive in exact lexical order. So programs don’t need to know anything about this complex presentational mess (except to add dir=”rtl” to the tag).