This is a simpler precursor of Huffman coding. In essence, it is just a bit representation that reflects the actual probability of each number in the data. This is in contrast to the “regular” bits that assume all possible numbers are equally likely in the data. For example, byte 21 is equal to or less than 127, therefore the 1st bit is 0, then it is compared against 63 and so on, thus arriving at 00010101. But for a typical piece of text data, some bytes would be more frequent than the others. Therefore, the 1st comparison would be with something less than 127, the next would be less than 63, and so on. Thus, it presumably would result in shorter bits along the same line of the later and more optimal Huffman coding, as proposed by Shannon and Fano, and therefore named SFC (Shannon-Fano coding) for convenience.

### The algorithm

As described above, if the data is random, then the medium byte array would be 127, 63, 31, 15, 7, 3, 1, 0, 1, 2, 3, 5, 4, 5, 6, 7, 11, … But for text data, it could be something like 10, 4, 2, 1, 0, 0, 1, 2, 3, 3, 4, 7, … These are derived from the frequencies (same as Huffman coding) of each data byte. It is sent along with the compressed data. Being an array, it is easier to handle than the binary tree in Huffman coding.

As a simple illustration, assuming there are 6 different bytes in the data with the following counts in non-ascending order:

Cnt[] of {15, 76, 59, 123, 68, 154} = {40, 15, 10, 5, 2, 1}

Map {15, 76, 59, 123, 68, 154} to N[] = {0, 1, 2, 3, 4, 5}

Now construct the medium array M[] for N[]:

The 1st element in M[] is the medium of N[]: add up 40 + 15 + 10 + 5 + 2 + 1 = 73, and since 40 >= 73/2, so M[0] = 0.

M[1] is the medium of all elements in N[] that are <= M[0]. Since there is only one such element, M[1] = M[0] = 0. M[1] is also an “end” element (the only one).

M[2] is the medium of all elements in N[] that are > M[0]: 15 + 10 >= (15 + 10 + 5 + 2 + 1)/2, so M[2] can be 2, but 1 is closer to medium, so M[2] = 1.

M[3] is the medium of all elements in N[] that are > M[0] and <= M[2]. Since there is only one such element, M[3] = M[2] = 1. M[3] is also an “end” element.

…, thus arriving at the medium array, M[] = {0, 0, 1, 1, 2, 2, 3, 3, 4, 5}. Some rearrangement is made to this array to shorten it before eventually attaching it to the compressed data. But this is not critical, as this array is typically quite short compared with the data. There is also a Boolean array associated with the medium array, E[] = {0, 1, 0, 1, 0, 1, 0, 1, 0, 1}, indicating if the element is at the end (of a search).

0 Byte 0 mapped to bits:

<= M[0], followed by 0 <= M[1] and also E[1] = 1 (reaching end), so Bits[0] = 0.

Byte 1 mapped to bits:

1 > M[0], bypass M[1] to get to the next element greater than M[0], so that’s M[2], 1 <= M[2], followed by 1 <= M[3] and also E[3] = 1, so Bits[1] = 10.

Byte 2 mapped to bits:

2 > M[0], bypass M[1] to get to the next element greater than M[0], so that’s M[2], 2 > M[2], bypass M[3] to get to the next element greater than M[2], so that’s M[4], 2 <= M[4], followed by 2 <= M[5] and also E[5] = 1, so Bits[2] = 110.

…, thus Bits[] of {0, 1, 2, 3, 4, 5} =

0

1 0

1 1 0

1 1 1 0

1 1 1 1 0

1 1 1 1 1

So data compression just maps bytes 0, 1, 2, … to the corresponding Bits[].

Decompression reverses this process, mapping the bits back to 0, 1, 2, …, then back to 15, 76, 59, 123, 68, 154.

Depending on the text size, some common words may show up quite often. So concatenation of data bytes into bigger units may pay off in achieving better compression. For better performance, the principle of Rice coding is borrowed here. That is, the compression is done not on the data itself, but on the number of bits in each data unit. The (transformed) data stripped of the top bit (must be 1) is sent as it is. Decompression extracts the recorded number of bits back into each unit. Thus compression is primarily data mapping, with bit conversion playing a lesser role. This (recursive) concatenation scheme is somewhat like dictionary coding, such as LZW, in that it can also be applied to the likes of GIF, pNG and clip arts.