Cannonical Forms and Normalization

The use of diacritical (“accent”) marks varies from language to language. Sometimes the mark changes the way that the character is pronounced. Sometimes it represents a tone shift. In some languages the “letters” are consonants and the “marks” are vowels. In some languages the diacritical marks are components that must be assembled together in order to form a meaningful character.

If the mark simply changes the pronunciation of the letter, as in most European languages, then the accented letter is plausibly a single character. Most modern character codes have a unique code value for characters like è (e with grave accent) and ç (c with cedilla). However, when a mark acts as a vowel then the mark really is a separate character even though it is displayed above or below some other character.

In the days before computers, European typewriters had a method of typing characters without advancing the print element to the next position. The previously typed character could then be overstruck with a second character producing a compound result. Thus grave accent ” ̀” and the letter “e” would be two separate key strokes that, struck one after the other, generated “è”.

When this technique was converted to computers, it became the concept of the “non-spacing” characters. Accented characters would be stored in files as two character sequences (accent mark first) that combine to represent the accented character. However, as computers and printers became more powerful, it became more common to represent the accented character as a single code point.

In the national politics (or technical opinion) that went into the Unicode standard, it was decided to retain the non-spacing accent characters as well as providing single character codes for all the accented letters in common use. Thus there are multiple ways in Unicode to represent accented characters, a two character sequence in which the first character is non-spacing or a one character value.

Then there is the problem of “ligatures”, where certain langauges have a single character that combines two letters together. Typically this is a typographic convention, so even when the “œ” ligature exists, many documents will write the same words using the two characters “o” and “e” typed separately.

Then German has the special character “ß” which substitutes for “ss”. Its use is a regional and stylistic matter, so some documents will write “weiß” and some will write “weiss”.

The Unicode standards organization decided to tackle the hard problem of comparing strings and develop some systematic standards. In some sense, once they had decided to include both single and double code versions of the same accented character, they had a moral obligation to provide guidance about the mess they left behind. The rules explain how to handle different representations of the same thing in different character codes and then compare two strings to see if they are equal.