3.2.4 Lossless Compression



 

           Data compression operates in general by taking "symbols" from an input data, processing them, and writing "codes" to a compressed file. Symbols are usually bytes, but they could just as easily be pixels, 80-bit floating point numbers, or EBCDIC characters. To be effective, a data compression scheme needs to be able to transform the compressed data back into an identical copy of the input data.

 

           Dictionary based compression systems operate by replacing groups of symbols in the input data with fixed length codes. A well-known example of a dictionary technique is LZW data compression[8]. LZW operates by replacing strings of essentially unlimited length with codes that usually range in size from 9 to 16 bits.

 

           Statistical methods of data compression take a completely different approach. They operate by encoding symbols one at a time. The symbols are encoded into variable length output codes. The length of the output code varies based on the probability or frequency of the symbol. Low probability symbols are encoded using many bits, and high probability symbols are encoded using fewer bits.

 

           In practice, the dividing line between statistical and dictionary methods is not always so distinct. Some schemes can't be clearly put in one camp or the other, and there are always hybrids which use features from both techniques. However, in this study we used arithmetic coding[9] to compress highly repeated quantized coefficients to minimize compressed size. Arithmetic coding is better than Huffman coding when many repetitions are occurred.

 

3.2.4.1  Arithmetic Coding

 

           It has only been in the last ten years that a respectable candidate to replace Huffman coding has been successfully demonstrated: Arithmetic coding. Arithmetic coding completely bypasses the idea of replacing an input symbol with a specific code. Instead, it takes a stream of input symbols and replaces it with a single floating point output number. The longer (and more complex) the message, the more bits are needed in the output number. It was not until recently that practical methods were found to implement this on computers with fixed sized registers.

 

           The output from an arithmetic coding process is a single number less than 1 and greater than or equal to 0. This single number can be uniquely decoded to create the exact stream of symbols that went into its construction. In order to construct the output number, the symbols being encoded have to have a set probabilities assigned to them. For example, if it is needed  to encode the random message "BILL GATES", there would be a probability distribution that looks like this:

 

 

 Character    Probability

---------    -----------

 SPACE          1/10

   A            1/10

   B            1/10

   E            1/10

   G            1/10

   I            1/10

   L            2/10

   S            1/10

   T            1/10

 
 

           Once the character probabilities are known, the individual symbols need to be assigned a range along a "probability line", which is nominally 0 to 1. It doesn't matter which characters are assigned which segment of the range, as long as it is done in the same manner by both the encoder and the decoder. The nine character symbol set used here would look like this:

 
 

  Character    Probability      Range

 ---------    -----------   -----------

  SPACE          1/10       0.00 - 0.10

    A            1/10       0.10 - 0.20

    B            1/10       0.20 - 0.30

    E            1/10       0.30 - 0.40

    G            1/10       0.40 - 0.50

    I            1/10       0.50 - 0.60

    L            2/10       0.60 - 0.80

    S            1/10       0.80 - 0.90

    T            1/10       0.90 - 1.00

 
 

           Each character is assigned the portion of the 0-1 range that corresponds to its probability of appearance. Note also that the character "owns" everything up to, but not including the higher number. So the letter 'T' in fact has the range 0.90 - 0.9999.... The most significant portion of an arithmetic coded message belongs to the first symbol to be encoded. When encoding the message "BILL GATES", the first symbol is "B". In order for the first character to be decoded properly, the final coded message has to be a number greater than or equal to 0.20 and less than 0.30. What is needed to encode this number is keep track of the range that this number could fall in. So after the first character is encoded, the low end for this range is 0.20 and the high end of the range is 0.30.

 

           After the first character is encoded, it is known that the range for output number is bounded by the low number and the high number. What happens during the rest of the encoding process is that each new symbol to be encoded will further restrict the possible range of the output number. The next character to be encoded, 'I', owns the range 0.50 through 0.60. If it was the first number in our message, it would be set the low and high range values directly to those values. But 'I' is the second character. So what is done instead is that 'I' owns the range that corresponds to 0.50-0.60 in the new subrange of 0.2 - 0.3. This means that the new encoded number will have to fall somewhere in the 50th to 60th percentile of the currently established range. Applying this logic will further restrict the number to the range 0.25 to 0.26.

 

           The algorithm to accomplish this for a message of any length is is shown below:

 

Set low to 0.0

Set high to 1.0

While there are still input symbols do

    get an input symbol

    code_range = high - low.

    high = low + range*high_range(symbol)

    low = low + range*low_range(symbol)

End of While

output low

 

           Following this process through its natural conclusion with the chosen message looks like this:

 

New Character   Low value        High Value

-------------   ---------        ----------

                0.0              1.0

    B           0.2              0.3

    I           0.25             0.26

    L           0.256            0.258

    L           0.2572           0.2576

  SPACE         0.25720          0.25724

    G           0.257216         0.257220

    A           0.2572164        0.2572168

    T           0.25721676       0.2572168

    E           0.257216772      0.257216776

    S           0.2572167752     0.2572167756

 

           So the final low value, 0.2572167752 will uniquely encode the message "BILL GATES" using the present encoding scheme.

 

           The algorithm for decoding the incoming number looks like this:

 

get encoded number

Do

    find symbol whose range straddles the encoded number

    output the symbol

    range = symbol low value - symbol high value

    subtract symbol low value from encoded number

    divide encoded number by range

until no more symbols

 

           The problem of how to decide when there are no more symbols left to decode can be solved by either encoding a special EOF symbol, or carrying the stream length along with the encoded message.

 

           The decoding algorithm for the "BILL GATES" message will precede something like this:

 

Encoded Number  Output Symbol    Low   High   Range

--------------  -------------    ---   ----   -----

0.2572167752          B          0.2   0.3     0.1

0.572167752           I          0.5   0.6     0.1

0.72167752            L          0.6   0.8     0.2

0.6083876             L          0.6   0.8     0.2

0.041938            SPACE        0.0   0.1     0.1

0.41938               G          0.4   0.5     0.1

0.1938                A          0.2   0.3     0.1

0.938                 T          0.9   1.0     0.1

0.38                  E          0.3   0.4     0.1

0.8                   S          0.8   0.9     0.1

0.0

 

           In summary, the encoding process is simply one of narrowing the range of possible numbers with every new symbol. The new range is proportional to the predefined probability attached to that symbol. Decoding is the inverse procedure, where the range is expanded in proportion to the probability of each symbol as it is extracted.

 


          

Copyright by Chasan Chouse