Naked Bytes

A file on disk or an internet connection is a stream of bytes. To make sense of these bytes, you have to be able to guess what type of data they contain. It could be a picture of a dog, an MP3 song, or a copy of Pride and Prejudice. Even if you know it is text, it could be a Web page a Word document, or a Kindle book

You can usually guess what type of data is in the file from the file type that follows the name. A file that ends in *.jpg is a picture in  JPEG format, while a *.mpg file is a recorded TV program. However, even when you know something is a Web page because it ends in *.html,  you still don’t know quite enough.

What is the character encoding? There are no differences in the file name to distinguish a page encoded in the ISO 8859-1 set and written in German from a page encoded in UTF-8 and written in Arabic. Fortunately, if the page is written in English and uses only the ASCII characters, then all of the ISO 8859-? sets produce the same stream of bytes as does UTF-8, so it doesn’t matter. It is also true that on any single computer all of the files may use the same encoding so on that system (or on all the machines in that organization) everyone knows which encoding to use.

The problem didn’t become obvious until it was too late. When the operating systems were being developed, we could have added an attribute for the text encoding to the file system. Then we could have written every editor and program to understand and honor encodings. It didn’t happen at the time, but it is something to work toward in the future.

Just Before the Web

The systems, editors, and a lot of the data were already in place before the wide spread use of the Web began in 1994-95. America was the biggest user of computers, and ASCII was good enough for most English text. That produced a bunch of editors and utilities, many still in use today, that have absolutely no idea of character encoding.

The introduction of the PC in 1981 was probably the last moment when a major product could be introduced that understood nothing about any foreign language character standards (although even then IBM was being particularly dumb). By the time the Macintosh and Windows were being developed in the mid ’80s, it was recognized that systems, programs, and devices must at least recognize the ISO 8859-1 Western character set.

Other alphabets were regarded as a national problem requiring national solutions. The Chinese (Taiwan in those days) and Japanese had their own home-grown double byte national character sets and devices and programs to edit, display, and print them.

Then along came the one-two punch of the Internet and the World Wide Web. Suddenly in the middle of the ’90s every computer was connected to every other computer, and a Web page from France could contain a hyperlink to a Web page in Japan.

Content Type

When Tim Berners-Lee was inventing the World Wide Web, he had the advantage of working at the multinational CERN project in Europe and of calling the thing “World Wide” in the first place. Since there was nothing in the stream of bytes to reliably tell what code set was being used, he added the Content-Type header to the HTTP protocol he was designing. HTTP headers are generated by the Web server and they come in front of the page data. In the Content-Type header, the server could tell the browser to expect that the page data was encoded in ISO 8859-4 (Baltic States and Greenland) or Shift-JIS (Japanese).

This did not solve the problem magically for all files, it simply transferred the problem to the Web Server. The Web Server had to manufacture the Content-Type header. In the US there was no work, because files were ASCII and the defaults all worked. In France or Japan a Web Server could be configured to tag every file with a Content-Type of 8859-1 or Shift-JIS. So the problem was really only serious when a Server had multiple files with multiple different encodings. Then, for the most part, the problem was unsolvable.

Metadata

“Metadata” is a fancy word for information contained in a file about the file itself and its creation that is not supposed to be displayed to the end user unless he asks for it. Metadata may tell you who created the file, what program was used to edit it, and so on.

HTML is the language in which Web pages are written. It provides for a Header that contains control information, javascript, and metadata. One of the metadata tags can provide a suggestion to the Web server about what to put in the Content-Type header. If you use the Microsoft Office Web Designer to create an empty new Web page, it will contain something like this:

Untitled 1

If you only edit this page with the same editor, and you publish this data directly to the directory used by the Web server, then the tag in the second line provides a hint to the Web server that the file was written to disk using the utf-8 Unicode encoding. If the file contains only Engish text with no special characters, this will be correct because it is also meaningless. Every file containing only English text with no special characters is the same in every ISO encoding and utf-8.

However, suppose someone edits the file with a different editor, and he inserts a slightly unusual character like “®” (the registered trademark symbol). When he goes to save the file, it will be written to disk in whatever character encoding that particular editor likes to use. A plain text editor does not know about the special significance of the Content-Type HTML tag (for that matter, it does not know about HTML). If the editor saves files in ISO 8859-1, then the special caracter will be written to disk as a different byte value than would have been used if it was actually coded in utf-8.

So now the metadata tag that was supposed to solve the problem of character encoding by providing the Web Server with a hint about what Content-Type header to send to the browser has actually made things worse. Because someone who did not understand character sets and HTML used the wrong editor, the header is now telling the server and browser that the file is in a different encoding than was actually used.

Sniffing

Browsers realize that most files do not contain metadata, that not all Web servers send an encoding value in the Content-Type header, and that sometimes the information they get is wrong.

The solution is to read into memory the start of the file (anything from 512 bytes to many thousands of bytes) and then poke around a bit to develop an intelligent guess about what is the right encoding to use.

Originally “sniffing” was just some home grown code that Mozilla or Microsoft added to their browser to increase the chance that it would do the right thing most of the time. However, today sniffing has become so universal that it has become part of the HTML 5 standard proposed for all browsers. If you are going to make a wild-ass guess about something, it is comforting if it is an industry standard wild-ass guess.

The details of sniffing are beyond the technical scope of this paper. One point is that the first few hundred bytes of every Web page are typically plain ASCII, so if you look at the first few bytes and every other byte is 0 you can guess that the file uses a 16 bit character set. Then you look ahead for metadata and some text. If a proposed encoding produces invalid characters, try a different encoding. In the end you are not guaranteed to generate the right result, but most of the page will be readable and if a few strings are garbled the reader can manually change the encoding using the browser menus to get a better result.

Long Term

However, a wild-ass guess by the browser will not solve the larger problem across the entire computing infrastructure. Character encoding should be a standard feature of every file, known at the operating system level and enforced across all programs. This objective is incompatible with the Unix or Windows disk file systems as they now exist, and obviously it would take a long time for every program to be modified to accomodate a new standard. The problem is so big that the industry has never been willing to take the first step.

So people and programs will continue to make mistakes. Eventually we will migrate to new systems, new standards, new tools, and new formats that make the problem less pressing. At least now you understand that there is a problem and recognize some of the incomplete attempts to solve it.