Newline and XML 1.1 | pclt.sites.yale.edu

On old IBM mainframe computers, each line of text was a separate “record” in the file. However, modern computers treat each text file as a stream of characters, one after another. To mark the end of one line and the beginning of the next, you need some control character typically called a “newline”.

Control characters were developed in the Teletype days before computers. Instead of printing a character on the page, they performed some function on the device. For example, the BEL character rang a bell to draw the attention of an operator to an important message. Other control characters started and stopped the paper tape punch. The first character (NUL with a value of 0) and the last character (DEL with a value of 127) did nothing and were useful when you wanted the machine to go idle for the amount of time it took to process one character.

The Teletype needed two separate characters to end a line and start a new line. The Carriage Return (CR) moved the print element back to the left margin of the paper, while the LineFeed (LF) character moved the paper up, advancing the print element to the new line. This may be surprising to those who remember old typewriters where the Return key did both (returning to the left margin and advancing the paper in one step).

The Teletype was a communications device before anyone even imagined computers. From it and similar devices, there developed a set of communications standards that transitioned after computers were invented into computer communications standards. Although better devices that the old clunky teletype quickly developed with new features like lowercase letters (the Teletype was uppercase only), they retained compatibility with the Teletype conventions. This meant that every new line was identify by two characters, a CR and and LF.

The early simple operating systems did not distinguish between the stream of characters stored on disk and the stream of characters sent to the terminal device. So computers built by Digital Equipment Corporation (DEC), the second largest computer maker after IBM, separated each line on disk by the CR-LF sequence.

However, some terminals offered higher speed or better quality printing if you were prepared to tolerate special limitations. As new terminals could print characters faster, they were not always able to move the print element all the way across the paper back to the left margin in the same amount of time it took to print a single character. Since this was all mechanical, the only way to make use of such devices was to “pad” every CR character you sent them with one to three NUL characters that did nothing but allowed extra time for the print element to get all the way back across the page. That meant that a different stream of characters had to be sent to each specific device. That meant that the characters on disk could not be the same as the characters sent to each different terminal.

If you are going to have to perform special processing at the end of each line, there is no reason to have two characters CR-LF taking up two bytes on disk. When Unix was developed, they decided to only store the LF character on disk and call it a NewLine. Apple later decided to store only the CR character.

DEC had already decided to store the two CR-LF characters. The first operating system for pre-IBM personal computers (CPM) was based on DEC systems and used CR-LF. Microsoft PC DOS was based on CPM, and Windows is based on PC DOS. So today Windows computer still store the two byte CR LF sequence at the end of each line.

Which brings us back to IBM. When computers needed terminals, IBM was already the worlds leading manufacturer of electric typewriters. Typewriters have a Return key that both moves the carriage back to the left margin and advances the paper. IBM was the one vendor that (at that time) did not use the ASCII code derived from Teletypes. It had an eight bit code in the days when ASCII was seven bits, and with room for extra control characters it had a NewLine character (NL) that was different from CR or LF.

Eventually ASCII expanded to eight bits also. When it did, it added a new block of control characters at the start of the new characters (from value 128 to 159). One of the characters in this new block was an ASCII NL.

Unicode in turn respected the previous standards. The first 256 Unicode characters are eight bit ASCII (from the ISO 8859-1 set), and that means that Unicode has all the same control characters. However, while the ASCII standards developed when there were still computer terminals, by the time Unicode arrives computer communication has converted to use the Internet and there is no longer any reason to have most of the control characters. Newline and tab are the only meaningful control functions left.

XML was based on Unicode. That meant that XML should have accommodated all the control characters in the Unicode standard. However, the XML 1.0 published standard screwed up and forgot that code values 127 to 159 are reserved for control characters. Instead, XML 1.0 said that characters with a value less than 32 were control characters and everything else should be treated as text.

The XML 1.0 standard was wrong. It claimed to be based on Unicode, and that meant it had to follow the Unicode standard. At the same time almost nobody cared that it was wrong because almost nobody used control characters in the range 128 to 159. Except IBM, which translated its NL character to the ASCII/Unicode control code with a value of 133. They complained the standard was wrong, and the mistake was fixed in the XML 1.1 standard.

Unfortunately, the XML processing libraries available at the time had been written to follow the exact language of the 1.0 standard, and correcting the error meant that they had to be changed. There was some grumbling about having to change their code to support some obsolete control characters that only one manufacturer used, but standards are standards and eventually the libraries were updated.

Which brings us to the problem of the canonical representation of “end of line”. On the PC it is represented by CR-LF, on Apple it was CR, on Unix/Linux it is LF, but occasionally it is NL (133). There are also some more obscure specialized end of line values with much higher Unicode code points for specialized devices. No matter how the local operating system chooses to delimit the end of line in the file system, when you are comparing different files from different sources the Unicode rules require all the different versions of end of line to be translated to a common standard before blocks of text are compared.