Character Encoding and Web Standards

History is a sequence of logical steps. Each makes sense at the time, but you can end up with a mess. The use of various character sets in various languages has been a problem in technology that dates back long before computers.

The Web is International

The Web browser is able to display text in any langauge and character set. The Web standards support this. You can create some international content simply by using the right editor, but at some point it helps to know how things work.

Terms

A character is a symbol with meaning. A character set is a collection of characters for some group of languages or applications. Characters can be assigned a numeric Code so they can be stored as data, but various Encoding systems allow commonly used characters to be stored efficiently if less frequently encountered characters are allowed to take up more space.

We had this problem before Computers

Characters had to be encoded in the early 20th Century for telegraph applications (AT&T) and punch card tabulation (IBM).

Eight Bit Codes

Since computers store data in an 8 bit byte, we spent most of the 1970’s and ’80s trying to squeeze each language (except Far East) into eight bit codes.

Bidirectional Character Sets

In addition to geopolitical problems, the Middle East provides “left to right” languages that pose special presentation problems.

Far East Local Double-Byte Codes

In the early days of computers, China, Japan, and Korea developed specific local two byte codes for their enormous character sets.

Unicode and ISO 10646

Unicode is a universal character set that supports all languages in current use. It has become the standard for the Web and for modern programming languages.

Localization and Internationalization

A computer located in one country that vends pages in one language can adopt whatever legacy character encoding system is most familiar. However, other pages need to be able to quote short passages in any foreign character set.

Programming Languages

Early programming languages supported only 8-bit characters. Most that are still in use have been extended to also support Unicode. More modern programming languages support only Unicode strings, although they can encode it to 8-bit files and network streams.

Newline and XML 1.1

The end of the line is not a problem for printed paper, but computer systems have developed incompatible conventions over time. This is a small example of how computer professionals can develop narrow minds and misinformation.

Cannonical Forms and Normalization

When there is more than one way to write something, one needs a standard to which everything can be converted to answer questions like “are these two book titles the same”

Naked Bytes

The standards for character sets, communication, and the Web establish a proper place to specify character sets and encoding. Unfortunately, these rules are not followed. So in practice a Web Browser can read through a little bit of the page and make a best guess about what characters are being used and how to present the data. If it guesses wrong, you can change the encoding manually from the “View” menu.

The Multinational Page

Encoding may determine which language and character set to use for an entire file, but when you start to mix character sets on the same page you need to know which attributes to use to distinguish paragraphs and quotations.