On the Overhead of Range Coders
Timothy B. Terriberry
Abstract
We analyze the inefficiency of range coders due to finite precision arithmetic
and compare it with that of arithmetic coders.
1. Introduction
Arithmetic coding is a practical method of entropy coding that can be
implemented with finite precision
arithmetic [Riss76].
It operates at the bit level, generating zero or more bits of output per each
coded symbol.
Range coding [Mart79], itself a rediscovery of the
FIFO arithmetic code of [Pasc76], is a
generalization of arithmetic coding that operates on digits of any base.
It became a popular alternative to arithmetic coding for two reasons.
First, it was developed independently and believed to be unpatented.
The original patents on arithmetic coding have expired, so this is no longer
a compelling argument.
Variants of both methods using table lookups or multiplyfree approximations
were patented long after the original techniques were developed, and these
must still be avoided.
Second, it can operate at the byte level, making it almost twice as fast on
general microprocessors [Schin98].
For highrate, realtime applications such as video coding, this is of critical
importance.
This article examines the relative compression inefficiency of using
(bytewise) range coding instead of (bitwise) arithmetic coding.
2. Background
An arithmetic coder maintains two bbit state variables, L
and R.
These describe the current interval [L,L+R) into
which the final encoded value must fall.
Symbols are encoded by partitioning the interval R according to the
probability of of each symbol.
These probabilities are represented as cumulative frequency counts.
Three values are passed to the encoder to encode a symbol s.
The first, l_{s}, is the
cumulative sum of the frequencies of the symbols less than s.
The second, h_{s}, is the
cumulative sum of the frequencies of the symbols up to and including
s.
The last, t_{s}, is the total
sum of the frequencies of all the symbols.
These are all represented with f bits of precision as integers in the
range [0,2^{f}].
The probability of symbol s is
p_{s} = (h_{s}  l_{s})/t_{s} .
Hereafter we will drop the suffix s when it is immaterial.
Assuming the model is accurate, the new range size after coding a symbol should
be exactly
However, finite precision arithmetic deviates from this exact value, causing
coding inefficiency.
Rissanen's original coder requires two
b×(f+1)→(b+f+1) multiplications and two
(b+f+1)×(f+1)→b divisions per
coded symbol to update the range:
R' ← ⌊(R·h) / t⌋  ⌊(R·l) / t⌋ .


(2) 
Moffat et al. introduced an approach that requires only one
b×(f+1)→b division and two
b×(f+1)→b
multiplications [MNW98]:
r 
← ⌊R / t⌋ , 
R' 
← 
{ 
r·(h  l) , 
l > 0 , 
R  r·(th) , 
l = 0 ^{1}. 

(3) 
(4) 
The error here is clearly larger than in Rissanen's method, assuming a fixed
value of b.
However, the reduced precision requirements allow a larger b
and f to be used without requiring extended precision
arithmetic, reducing the total overhead for a given computation budget.
This makes [MNW98] the preferred method of
implementing arithmetic coders.
Moffat et al. conducted an analysis of the error of their arithmetic
coder and showed that as long as
b  f ≥ 6, the worstcase overhead is
less than 1%.
However, a similar analysis has not been conducted for a range coder.
The only practical difference between modern arithmetic coders and range coders
is the renormalization loop.
To prevent R from growing too small and losing all precision in the
division of (3), after each update R is
renormalized so that it lies in the interval
[2^{bc1},2^{b1}) .
The top bit is reserved as a carry bit.
This normalization is done by outputting the top c bits
of R and scaling R
by 2^{c} until it lies in the
interval.
Setting c = 1 yields a normal arithmetic coder, but
setting c = 8 produces a range coder.
The difference in the allowable values of R leads to different
error analysis results.
3. Error Analysis
3.1 Worst Case Analysis
Following the analysis in [MNW98], the maximum
overhead due to finite precision arithmetic is, in bits per symbol,
E(p_{0},r) = p_{0}·log_{2}(p_{0}·(r+1)/(1+p_{0}·r)) + (1p_{0})·log_{2}((r+1)/r) , 

(5) 
where p_{0} is the probability of
symbol 0.
This overhead is maximal when
p_{0} = 0 , and reduces
to zero when p_{0} = 1 .
The worstcase value of r is
2^{bfc1},
since this maximizes (r+1)/r while keeping R
and t within their admissible ranges.
This gives an absolute worstcase persymbol overhead of
E^{*} = E(p_{0}=0,r=2^{bfc1}) = log_{2}((r+1)/r) .


(6) 
3.2 Average Case Analysis
The worstcase analysis is extremely pessimistic, since there is only a small
probability that R obtains the smallest possible value in its
range.
Moffat et al. argue that, under mild assumptions about the input
probabilities^{2},
R is distributed proportionally to 1/R.
Then the overhead, averaged over all possible internal states, is given by
r'  = R/t , 
  2^{b1}1 
K  =  ∑  1/R , 
  R = 2^{bc1} 
  2^{b1}1 
Ē^{*}  =  ∑  (1/R)·(p_{0}·log(p_{0}r'/((r'r)+p_{0}r)) + (1p_{0})·log_{2}(r'/r)) / K . 
  R = 2^{bc1} 

(7) 
(8) 
(9) 
We assume that
t = 2^{f}, since
this gives the modeler the maximum precision available with which to represent
the symbol probabilities.
Then the overhead, using the worstcase assumption
p_{0} = 0 and averaging over
all possible internal states, is given by
 2^{b1}1 
Ē^{*} = 
∑  (1/R)·log_{2}((R/2^{f})/⌊R/2^{f}⌋) / K . 
 R = 2^{bc1} 

(10) 
Moffat et al. argue that when b is large, the resulting values of
Ē^{*} are relatively independent
of b.
Indeed, by using two conservative approximations, we can obtain estimates that
depend only on the difference b  f.
The first approximates the normalization factor:
The next replaces the sum over all of the values of R which have
the same ⌊R/2^{f}⌋ with a
single closedform expression, under the assumption that f is
large:
(r+1)·2^{f}1   
(r+1)·2^{f} 
∑  (1/R)·log_{2}((R/2^{f})/⌊R/2^{f}⌋)  ≈ 2^{f} 
∫  (1/R)·log_{2}((R/2^{f})/r) dR , 
R = r·2^{f}   
R = r·2^{f} 
  = (log_{2}(r+1)  log_{2}(r))^{2}/(2·log_{2}(e)) . 

(12) 
(13) 
Substituting these two approximations into the expression
for Ē^{*} yields
 2^{bf1}1 
Ē^{*} ≈ (1/(2c))·  ∑  (log_{2}(r+1)  log_{2}(r))^{2} , 
 r = 2^{bfc1} 

(14) 
which is much easier to evaluate when both b and f
are large.
3.3 Comparison with Typical Values
The maximum allowable value of f is
b  c  1, meaning that arithmetic
coders allow 7 extra bits of precision in the model probabilities.
This extra room is only useful if the probabilities are accurate to this
precision, but when using very large alphabets with small persymbol
probabilities, it can be helpful.
However, as Moffat et al. showed, using an f close to this limit
introduces a substantial amount of overhead, as illustrated in
Table 1.
b 
c 
f 
E^{*} 
Ē^{*} 
32  1  30  1.0  0.5 
32  1  29  0.584962501  0.257218627 
32  1  28  0.321928095  0.129697656 
32  1  27  0.169925001  0.064993614 
32  1  26  0.087462841  0.032515222 
32  1  25  0.044394119  0.016259923 
32  1  24  0.022367813  0.008120251 
32  1  23  0.011227255  0.004065162 
32  1  22  0.005624549  0.002032585 

Table 1: The worst case and average overhead, in
bits per symbol, of an arithmetic coder for different amounts of precision in
the model probabilities.
Ē^{*} was evaluated using the
approximation in (14), but the values match those
Moffat et al. reported
using (10) with b = 20 to at
least three decimal places.

However, for a range coder, the difference between the average overhead and the
worst case overhead becomes much larger, as illustrated in
Table 2, since the worst case becomes so much more
unlikely.
b 
c 
f 
E^{*} 
Ē^{*} 
32  8  23  1.0  0.126610057 
32  8  22  0.584962501  0.064364130 
32  8  21  0.321928095  0.032338838 
32  8  20  0.169925001  0.016190150 
32  8  19  0.087462841  0.008097707 
32  8  18  0.044394119  0.004049184 
32  8  17  0.022367813  0.002024633 
32  8  16  0.011227255  0.001012322 
32  8  15  0.005624549  0.000506162 

Table 2: The worst case and average overhead, in
bits per symbol, of a range coder for different amounts of precision
in the model probabilities.
Ē^{*} was evaluated using the
approximation in (14).

For a fixed value of
b  f  c , the worst case
overhead E^{*} does not change.
This means that to ensure a worst case overhead of less than 1%, a range
coder requires 7 more bits of headroom, or
b  f ≥ 13 .
The average overhead is about four times smaller, meaning one can achieve an
expected overhead of less than 1% with
b  f ≥ 10 , compared to 5 for
an arithmetic coder.
3.4 Asymptotic Analysis
As can be seen in these tables,
both E^{*}
and Ē^{*} quickly become almost
directly proportional to t.
This raises questions about the constants of proportionality in these values
and their ratios.
For the worst case overhead, it is easy to show that
lim  E^{*}·2^{bf} = 2^{c+1}log_{2}(e) . 
bf → ∞ 

(15) 
The average case requires more work.
As b  f grows large, the sum
in (14) can also be replaced by an integral, though
in this case the result is always less than the true overhead, and so is not a
conservative estimate.
It does, however, produce a closedform approximation for the average overhead,
which can be used in asymptotic analysis:
  2^{bf1} 
Ē^{*}  ≈ (1/(2c))·  ∫  (log_{2}(r+1)  log_{2}(r))^{2} dr , 
  r = 2^{bfc1} 
 = (2^{bf1}(bf1)^{2}  2^{bfc1}(bfc1)^{2} + 2log_{2}(e)^{2}(Li_{2}(2^{bfc1})  Li_{2}(2^{bf1})) 
 + (1+2^{bf1})log_{2}(1+2^{bf1})·(log_{2}(1+2^{bf1})  2(bf1)) 
  (1+2^{bfc1})log_{2}(1+2^{bfc1})·(log_{2}(1+2^{bfc1})  2(bfc1)))/(2c) , 

(16) 
(17) 
where Li_{2}(·) is
Spence's
function, a special case of the
polylogarithm.
This can be used to show that
lim  Ē^{*}·2^{bf} = (2^{c}  1)·(log_{2}(e))^{2} / c . 
bf → ∞ 

(18) 
Ironically, using this limit to construct a linear approximation
of Ē^{*} in terms
of t is more accurate than (17).
Both (15) and (18) are
approached from below, meaning that they are conservative estimates of the
overhead rate.
The linearity in t also suggests that when t is allowed
to vary between 2^{f1}
and 2^{f}, instead of being fixed
at 2^{f}, the average overhead
will be reduced by a factor of 25%.
This is common in frequency count based modelers, which periodically rescale
the counts to keep
t ≤ 2^{f} .
Armed with these proportionality constants, the first question we can answer
is, how much better is the average case than the worst case?
lim  Ē^{*} / E^{*} = (2^{c}1)log_{2}(e)/(2^{c+1}c) . 
bf → ∞ 

(19) 
This limit is approached from above, meaning that this is the best improvement
over the worst case that one can expect as the headroom
b  f increases.
For c = 1, this means the average overhead is
about 36.1% of the worst case overhead (compared to Moffat et al.'s rough
estimate of 40%).
For a range coder, this works out to 8.98%.
To compare two different output
sizes c_{1}
and c_{2} directly, it is easy to
show that
lim 
E_{c1}^{*} / E_{c2}^{*} = 2^{c1c2} . 
bf → ∞ 

(20) 
This limit is approached from below, meaning that the worst case overhead for a
range coder is no more than 128 times that of an arithmetic coder for the
same amount of headroom.
Repeating the analysis for the average case, one finds that
lim 
Ē_{c1}^{*} / Ē_{c2}^{*} = (2^{c1}1)·c_{2} / ((2^{c2}1)·c_{1}) . 
bf → ∞ 

(21) 
Again, this limit is approached from below, meaning that the average overhead
for a range coder is no more than 31.875 times that of an arithmetic
coder.
Thus, to achieve the same average performance, a range coder will require no
more than 5 additional bits of headroom.
4. Conclusion
The speed benefit of range coders comes with an additional overhead penalty.
This article has attempted to quantify how large this penalty is, and to
allow an estimation of the overhead rate in terms of the word size,
probability precision, and output size.
This allows the establishment of an appropriate tradeoff between precision of
the model and the headroom required by the coder.
It also allows quick exploration of other design choices.
For example, many current DSPs do not offer bytewise addressing, making it
more convenient to use c = 16 than
c = 8.
A quick glance at (15)
and (18) shows that the worst case overhead would
increase by a factor of up to 256, and the average case overhead would
increase by a factor of up to 128.5.
Thus one would expect to require an additional 7 bits of headroom to
achieve a comparable average overhead.
If b = 32 and f is no larger than 15,
then this would still provide an expected overhead of less than 1%.
However, for wordbased text compression, the alphabet size alone may grow
larger than 2^{15}, making this clearly
unacceptable for that purpose.
When choosing f, one must trade off improvement in coder overhead
against impairment in modeler accuracy.
If one is using a frequencycount based model, a smaller f also
means more frequent rescaling, and thus carries some computational overhead
as well.
^{1}Although
only one of the two multiplications in (4) is needed
for each symbol, another is needed to update L.
This formula also reverses the orientation of Moffat et al.'s original
formulas, so that the first symbol receives all the extra weight due to the
truncation in the division, instead of the last.
This formulation is more convenient, as 0 is often the most probable, so
no rearrangement of the alphabet is required to minimize the overhead.
^{2}For
example, if the probabilities are all exact powers of two,
R will only take on a small subset of its possible values.
References
 [Mart79]

G. Nigel N. Martin:
"Range
Encoding: An Algorithm for Removing Redundancy from a Digitised Message."
In Proc. of the Institution of Electronic and Radio Engineers International
Conference on Video and Data Recording, No. 43, July 1979.
 [MNW98]

Alistair Moffat,
Radford M. Neal, and
Ian H. Witten:
"Arithmetic Coding Revisited."
ACM Transactions on Information Systems 16(3):256–294, July 1998.
 [Pasc76]

Richard C. Pasco:
"Source Coding Algorithms for
Fast Data Compression."
Ph.D. thesis, Stanford University Department of Electrical Engineering,
May 1976.
 [Riss76]

Jorma J. Rissanen:
"Generalised
Kraft Inequality and Arithmetic Coding."
IBM Journal of Research and Development 20(3):198–203, May 1976.
 [Schi98]

Michael Schindler:
"A
Fast Renormalisation for Arithmetic Coding."
In Proc. of the 8th Data Compression Conference (DCC'98), p. 572,
March 1998.
Comments or questions?
Send them to <tterribe@xiph.org>.

Created 7 Mar. 2008, last updated 7 Jun. 2008 