Programming Languages | pclt.sites.yale.edu

Early programming languages supported only 8-bit characters. Most that are still in use have been extended to also support Unicode. More modern programming languages support only Unicode strings, although they can encode it to 8-bit files and network streams

Standards for programming languages and the Web are defined in terms of abstract characters instead of codes. For example, the languages based on C (including C++, C#, Perl, Java, and JavaScript) all delimit a block of statements between the brace characters “{” and “}”. In ASCII and all the other international standards, these characters are assigned the code values of 123 and 125. However, a programmer can create a C program on an IBM mainframe where the EBCDIC character set represents these two characters as 192 and 208.

The syntactic elements of every programming language or Web standard attach significance only to the characters in the ASCII subset (except for mainframe languages like PL/I that used a few IBM characters like “¬”now mapped to the second half of the Latin 1 set). In an HTML or XML file, the language elements (tags, comments, directives) begin when the “<” character is encountered in the stream and end with the “>” character. A browser, editor, or utility must process the stream character by character to determine its syntax and structure.

Text literals in the program and HTML or XML tag content can contain a much larger range of characters. Older languages supported “characters” only in the sense of an array of bytes. The program could store and process the bytes and remained indifferent to any particular one byte character encoding that might be associated with the data.

Modern languages (Visual Basic, Java, C# and the other .NET languages) store characters as an array of two byte units. They support character strings and literals in the full Unicode set. Similarly, the HTML 4 and XML standards are defined over Unicode as the base character sets, so browsers and utilities that support these standards must internally process all text as Unicode.

What does it mean to say that HTML and XML use Unicode as their base character set? Tags still begin with the “<” character, and that character remains the same whether the file is coded in ASCII, 8859-1, Shift-JS, or UTF-8. The same can be said for the other base semantic characters like “>”, “=”, “&”, and the quote marks. HTML goes further to insist that the tag names and attributes be the familiar names in the Latin alphabet. Again, the “” and “” elements are the same in ASCII, 8859-6, or UTF-16, and English speakers will be happy to know that the “body” tag name remains the same even if the text is all French (take that, you Quebequois nationalists).

However attribute values, comments, and the content of tags that generate text on the screen can be formed from any Unicode character. XML goes further by allowing tag and attribute names to be formed from national language characters.

There are a few cases where Unicode characters are stored on disk as an array of 16 bit characters. The Windows NTFS file system, for example, stores file names and attributes as two byte Unicode characters. However, most file contents is compressed in UTF-8 or converted to one of the eight bit codes before it is stored as file or database contents.