|
Data compression operates in general by taking "symbols" from an
input data, processing them, and writing "codes" to a compressed file. Symbols
are usually bytes, but they could just as easily be pixels, 80-bit floating
point numbers, or EBCDIC characters. To be effective, a data compression scheme
needs to be able to transform the compressed data back into an identical copy of
the input data.
Dictionary based compression systems operate by replacing groups of
symbols in the input data with fixed length codes. A well-known example of a
dictionary technique is LZW data compression[8]. LZW operates by replacing
strings of essentially unlimited length with codes that usually range in size
from 9 to 16 bits.
Statistical methods of data compression take a completely different
approach. They operate by encoding symbols one at a time. The symbols are
encoded into variable length output codes. The length of the output code varies
based on the probability or frequency of the symbol. Low probability symbols are
encoded using many bits, and high probability symbols are encoded using fewer
bits.
In practice, the dividing line between statistical and dictionary
methods is not always so distinct. Some schemes can't be clearly put in one camp
or the other, and there are always hybrids which use features from both
techniques. However, in this study we used arithmetic coding[9] to compress
highly repeated quantized coefficients to minimize compressed size. Arithmetic
coding is better than Huffman coding when many repetitions are occurred.
3.2.4.1 Arithmetic
Coding
It has only been in the last ten years that a respectable candidate
to replace Huffman coding has been successfully demonstrated: Arithmetic coding.
Arithmetic coding completely bypasses the idea of replacing an input symbol with
a specific code. Instead, it takes a stream of input symbols and replaces it
with a single floating point output number. The longer (and more complex) the
message, the more bits are needed in the output number. It was not until
recently that practical methods were found to implement this on computers with
fixed sized registers.
The output from an arithmetic coding process is a single number less
than 1 and greater than or equal to 0. This single number can be uniquely
decoded to create the exact stream of symbols that went into its construction.
In order to construct the output number, the symbols being encoded have to have
a set probabilities assigned to them. For example, if it is needed to encode the random message "BILL
GATES", there would be a probability distribution that looks like this:
Character
Probability
---------
-----------
SPACE
1/10
A
1/10
B
1/10
E
1/10
G
1/10
I
1/10
L
2/10
S
1/10
T
1/10
Once the character probabilities are known, the individual symbols
need to be assigned a range along a "probability line", which is nominally 0 to
1. It doesn't matter which characters are assigned which segment of the range,
as long as it is done in the same manner by both the encoder and the decoder.
The nine character symbol set used here would look like this:
Character Probability
Range
--------- ----------- -----------
SPACE
1/10
0.00 - 0.10
A
1/10 0.10 -
0.20
B
1/10
0.20 - 0.30
E
1/10
0.30 - 0.40
G
1/10
0.40 - 0.50
I
1/10
0.50 - 0.60
L
2/10
0.60 - 0.80
S
1/10
0.80 - 0.90
T
1/10
0.90 - 1.00
Each character is assigned the portion of the 0-1 range that
corresponds to its probability of appearance. Note also that the character
"owns" everything up to, but not including the higher number. So the letter 'T'
in fact has the range 0.90 - 0.9999.... The most significant portion of an arithmetic coded message belongs
to the first symbol to be encoded. When encoding the message "BILL GATES", the
first symbol is "B". In order for the first character to be decoded properly,
the final coded message has to be a number greater than or equal to 0.20 and
less than 0.30. What is needed to encode this number is keep track of the range
that this number could fall in. So after the first character is encoded, the low
end for this range is 0.20 and the high end of the range is 0.30.
After the first character is encoded, it is known that the range for
output number is bounded by the low number and the high number. What happens
during the rest of the encoding process is that each new symbol to be encoded
will further restrict the possible range of the output number. The next
character to be encoded, 'I', owns the range 0.50 through 0.60. If it was the
first number in our message, it would be set the low and high range values
directly to those values. But 'I' is the second character. So what is done
instead is that 'I' owns the range that corresponds to 0.50-0.60 in the new
subrange of 0.2 - 0.3. This means that the new encoded number will have to fall
somewhere in the 50th to 60th percentile of the currently established range.
Applying this logic will further restrict the number to the range 0.25 to 0.26.
The algorithm to accomplish this for a message of any length is is
shown below:
Set low to 0.0
Set high to 1.0
While there are still input symbols do
get an
input symbol
code_range
= high - low.
high = low
+ range*high_range(symbol)
low = low +
range*low_range(symbol)
End of While
output low
Following this process through its natural conclusion with the chosen
message looks like this:
New Character Low value High
Value
------------- ---------
----------
0.0
1.0
B
0.2
0.3
I
0.25
0.26
L
0.256
0.258
L
0.2572
0.2576
SPACE
0.25720
0.25724
G
0.257216
0.257220
A
0.2572164
0.2572168
T
0.25721676
0.2572168
E
0.257216772
0.257216776
S
0.2572167752
0.2572167756
So the final low value, 0.2572167752 will uniquely encode the message
"BILL GATES" using the present encoding scheme.
The algorithm for decoding the incoming number looks like
this:
get encoded number
Do
find symbol
whose range straddles the encoded number
output the
symbol
range =
symbol low value - symbol high value
subtract
symbol low value from encoded number
divide
encoded number by range
until no more symbols
The problem of how to decide when there are no more symbols left to
decode can be solved by either encoding a special EOF symbol, or carrying the
stream length along with the encoded message.
The decoding algorithm for the "BILL GATES" message will precede
something like this:
Encoded Number Output Symbol Low High Range
-------------- ------------- --- ---- -----
0.2572167752
B
0.2 0.3
0.1
0.572167752
I
0.5 0.6
0.1
0.72167752
L 0.6 0.8
0.2
0.6083876
L
0.6 0.8
0.2
0.041938
SPACE
0.0 0.1
0.1
0.41938
G
0.4 0.5
0.1
0.1938
A
0.2 0.3
0.1
0.938
T
0.9 1.0
0.1
0.38
E
0.3 0.4
0.1
0.8
S
0.8 0.9
0.1
0.0
In summary, the encoding process is simply one of narrowing the range
of possible numbers with every new symbol. The new range is proportional to the
predefined probability attached to that symbol. Decoding is the inverse
procedure, where the range is expanded in proportion to the probability of each
symbol as it is extracted.
|