Run Length Encoding (RLE)

Run Length Encoding is a well-known lossless data compression method. There are several different implementation schemes, but they all reduce long groups of identical data elements into some shorter encoded forms. These shorter forms later expand to the exact original data with no loss of information. Certainly, the ideas about handling duplicate symbols in an efficient way predate the early hardware implementations. This paper by Richard B. Johnson [3]  is not about claiming invention or originality. Instead, it records his specific implementation, which was quite advanced for its time. His involvement was in drawing the schematic and building the device. He worked with two MIT students, neither of which considered this an “invention.” [6,7]  One of the students was quite expert with digital logic.

Narrative
When working as a technician for MIT’s Haystack Research Facility in the summer of 1966, there was the need to compress some data produced by an early A/D converter. I think it was an Epsco Datrac encoder. The usual way to capture moon-bounce data was to use video tape and transport it by automobile to the MIT campus in Cambridge. Actually, I knew that the data really went to the Lincoln Laboratory at Hanscom Air Force Base in Lexington, but nobody was to talk about that. An idea two graduate students were working on was to send the data directly using a microwave link. These two students were under the tutelage of Professor Thomas Stockham [8]  who was perfecting digital audio recording at the time. The microwave link was a television link designed for remote television broadcasting. We were only able to use the low-bandwidth audio signal path because the high-bandwidth video portion expected synchronizing pulses to operate. Since we borrowed the link, we could not modify it. Produced by Raytheon, it used vacuum tubes like most of the electronic equipment of the day. The problem was that the data rate exceeded the bandwidth of the system. To be able to send the data, I proposed to record the data on tape in real-time, then play the data back at a slower speed. This would get the data to the MIT campus, certainly faster than by automobile. Since I was only a technician, with little college training, (I was a sophomore at another college), the staff rejected my idea as the mouthing of a fool.

Not put down, I immediately responded with, “Of course you can always compress the data.” Few, including myself, had even heard of data compression before. This was simply something I spouted out in self-defense. At the time, I had no clue about actually doing it! Within a few days, one of the graduate students had checked with Professor Stockham and added the data compression task to his Master’s Degree project. Thus began a project in which 128-byte blocks of data clock into registers where they compare. If there are more than four identical bytes in a row, a count accumulates. In the output stream, this event substituted a sentinel byte, hexadecimal BB, followed by the byte-count, then the byte to repeat. This, I believe, was the first implementation of what we now know as RLE encoding. It used several hundred TTL integrated circuits on a “perf-board” and took over a month to construct using soldered connections. The choice of the value of the sentinel byte was the least used byte in typical data. The use of a sentinel byte rather than using a doubled value indicating the compressed data was for simplification of the hardware. Many years later, I used the same scheme, including the same sentinel byte value, but in software, for the JMODEM file-transfer protocol. [1,2,3,9] 

Method
Given a stream of data bytes with a hexadecimal representation such as:

00 00 00 00 00 00 00 00 01 02 03 04 05 06 07 08

Observe that there are eight bytes of zeros followed by additional data bytes. This encodes as:

BB 08 00 01 02 03 04 05 06 07 08

Observe that this stream is now only eleven bytes in length. This can result in a 16/11 = 1.5 improvement in data transfer rate through a bandwidth limited communications channel. In real communications systems, there are often very long streams of identical data symbols. This method is often called the “escape” method because it uses an escape character, the sentinel. A more “standard” implementation does not use a sentinel byte. Instead, it encodes as:
00 00 08 01 02 03 04 05 06 07 08

This method repeats the duplicate symbol “00,” with the same symbol, and then adds the symbol count. Ether method results in the same level of compression. The duplicate symbol was difficult to do in hardware (there were no CPUs available) so the method I chose was to use a sentinel byte. Some literature encodes RLL data as a length-byte followed by the byte to repeat or vice-versa. [5]  Such implementations are generally unworkable with real data as a string of incrementing bytes nearly doubles its original length. Observe that the string:

00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E

encodes to:
01 00 01 01 01 02 01 03 01 04 01 05 01 06 01 07
01 08 01 09 01 0A 01 0B 01 0C 01 0D 01 0E

Clearly, this does not represent compression. Nevertheless, this method is often cited in on-line references and in some textbooks as well. Methods that use either the double symbol or the escape methods allow data to remain non-encoded if the encoding would increase its length.

References
   1. Original JMODEM documentation
   2. The JMODEM file transfer protocol
   3. Dvorak, John C. (1989). Dvorak’s Guide to PC Telecommunications Osborne McGraw-Hill. ISBN 0078815517
   4. Richard B. Johnson’s biography
   5. RLE reference on Wikipedia
   The following two United States Patents seem to rely upon this well known prior art:
   6.  Method and apparatus for data compression and restoration U.S. patent 4,872,009
   7.  Method and system for data compression and restoration U.S. patent 4,586,027
   8. Professor Stockham
   9. C source code demonstrating RLE encoding and decoding

External links
Author’s website

Notes
This article may be freely copied or referenced. No copyright protection is claimed. The underlying HTML code is also in the public domain.