On the Overhead of Range Coders

Abstract

We analyze the inefficiency of range coders due to finite precision arithmetic and compare it with that of arithmetic coders.

1. Introduction

Arithmetic coding is a practical method of entropy coding that can be implemented with finite precision arithmetic [Riss76]. It operates at the bit level, generating zero or more bits of output per each coded symbol. Range coding [Mart79], itself a rediscovery of the FIFO arithmetic code of [Pasc76], is a generalization of arithmetic coding that operates on digits of any base. It became a popular alternative to arithmetic coding for two reasons. First, it was developed independently and believed to be unpatented. The original patents on arithmetic coding have expired, so this is no longer a compelling argument. Variants of both methods using table look-ups or multiply-free approximations were patented long after the original techniques were developed, and these must still be avoided. Second, it can operate at the byte level, making it almost twice as fast on general microprocessors [Schin98]. For high-rate, real-time applications such as video coding, this is of critical importance. This article examines the relative compression inefficiency of using (byte-wise) range coding instead of (bit-wise) arithmetic coding.

2. Background

An arithmetic coder maintains two b-bit state variables, L and R. These describe the current interval [L,L+R) into which the final encoded value must fall. Symbols are encoded by partitioning the interval R according to the probability of of each symbol. These probabilities are represented as cumulative frequency counts. Three values are passed to the encoder to encode a symbol s. The first, l_s, is the cumulative sum of the frequencies of the symbols less than s. The second, h_s, is the cumulative sum of the frequencies of the symbols up to and including s. The last, t_s, is the total sum of the frequencies of all the symbols. These are all represented with f bits of precision as integers in the range [0,2^f]. The probability of symbol s is p_s = (h_s - l_s)/t_s . Hereafter we will drop the suffix s when it is immaterial.

Assuming the model is accurate, the new range size after coding a symbol should be exactly

R' ← R·(h - l)/t .

(1)

However, finite precision arithmetic deviates from this exact value, causing coding inefficiency. Rissanen's original coder requires two b×(f+1)→(b+f+1) multiplications and two (b+f+1)×(f+1)→b divisions per coded symbol to update the range:

R' ← ⌊(R·h) / t⌋ - ⌊(R·l) / t⌋ .

(2)

Moffat et al. introduced an approach that requires only one b×(f+1)→b division and two b×(f+1)→b multiplications [MNW98]:

r	← ⌊R / t⌋ ,
R'	←	{	r·(h - l) ,	l > 0 ,
			R - r·(t-h) ,	l = 0 ¹.

(3)

(4)

The error here is clearly larger than in Rissanen's method, assuming a fixed value of b. However, the reduced precision requirements allow a larger b and f to be used without requiring extended precision arithmetic, reducing the total overhead for a given computation budget. This makes [MNW98] the preferred method of implementing arithmetic coders. Moffat et al. conducted an analysis of the error of their arithmetic coder and showed that as long as b - f ≥ 6, the worst-case overhead is less than 1%. However, a similar analysis has not been conducted for a range coder.

The only practical difference between modern arithmetic coders and range coders is the renormalization loop. To prevent R from growing too small and losing all precision in the division of (3), after each update R is renormalized so that it lies in the interval [2^b-c-1,2^b-1) . The top bit is reserved as a carry bit. This normalization is done by outputting the top c bits of R and scaling R by 2^c until it lies in the interval. Setting c = 1 yields a normal arithmetic coder, but setting c = 8 produces a range coder. The difference in the allowable values of R leads to different error analysis results.

3. Error Analysis

3.1 Worst Case Analysis

Following the analysis in [MNW98], the maximum overhead due to finite precision arithmetic is, in bits per symbol,

E(p₀,r) = p₀·log₂(p₀·(r+1)/(1+p₀·r)) + (1-p₀)·log₂((r+1)/r) ,

(5)

where p₀ is the probability of symbol 0. This overhead is maximal when p₀ = 0 , and reduces to zero when p₀ = 1 . The worst-case value of r is 2^b-f-c-1, since this maximizes (r+1)/r while keeping R and t within their admissible ranges. This gives an absolute worst-case per-symbol overhead of

E^* = E(p₀=0,r=2^b-f-c-1) = log₂((r+1)/r) .

(6)

3.2 Average Case Analysis

The worst-case analysis is extremely pessimistic, since there is only a small probability that R obtains the smallest possible value in its range. Moffat et al. argue that, under mild assumptions about the input probabilities², R is distributed proportionally to 1/R. Then the overhead, averaged over all possible internal states, is given by

r'	= R/t ,
		2^b-1-1
K	=	∑	1/R ,
		R = 2^b-c-1
		2^b-1-1
Ē^*	=	∑	(1/R)·(p₀·log(p₀r'/((r'-r)+p₀r)) + (1-p₀)·log₂(r'/r)) / K .
		R = 2^b-c-1

(7)

(8)

(9)

We assume that t = 2^f, since this gives the modeler the maximum precision available with which to represent the symbol probabilities. Then the overhead, using the worst-case assumption p₀ = 0 and averaging over all possible internal states, is given by

	2^b-1-1
Ē^* =	∑	(1/R)·log₂((R/2^f)/⌊R/2^f⌋) / K .
	R = 2^b-c-1

(10)

Moffat et al. argue that when b is large, the resulting values of Ē^* are relatively independent of b. Indeed, by using two conservative approximations, we can obtain estimates that depend only on the difference b - f. The first approximates the normalization factor:

K ≈ c/log₂(e) .

(11)

The next replaces the sum over all of the values of R which have the same ⌊R/2^f⌋ with a single closed-form expression, under the assumption that f is large:

(r+1)·2^f-1			(r+1)·2^f
∑	(1/R)·log₂((R/2^f)/⌊R/2^f⌋)	≈ 2^f	∫	(1/R)·log₂((R/2^f)/r) dR ,
R = r·2^f			R = r·2^f
		= (log₂(r+1) - log₂(r))²/(2·log₂(e)) .

(12)

(13)

Substituting these two approximations into the expression for Ē^* yields

	2^b-f-1-1
Ē^* ≈ (1/(2c))·	∑	(log₂(r+1) - log₂(r))² ,
	r = 2^b-f-c-1

(14)

which is much easier to evaluate when both b and f are large.

3.3 Comparison with Typical Values

The maximum allowable value of f is b - c - 1, meaning that arithmetic coders allow 7 extra bits of precision in the model probabilities. This extra room is only useful if the probabilities are accurate to this precision, but when using very large alphabets with small per-symbol probabilities, it can be helpful. However, as Moffat et al. showed, using an f close to this limit introduces a substantial amount of overhead, as illustrated in Table 1.

b	c	f	E^*	Ē^*
32	1	30	1.0	0.5
32	1	29	0.584962501	0.257218627
32	1	28	0.321928095	0.129697656
32	1	27	0.169925001	0.064993614
32	1	26	0.087462841	0.032515222
32	1	25	0.044394119	0.016259923
32	1	24	0.022367813	0.008120251
32	1	23	0.011227255	0.004065162
32	1	22	0.005624549	0.002032585

Table 1: The worst case and average overhead, in bits per symbol, of an arithmetic coder for different amounts of precision in the model probabilities. Ē^* was evaluated using the approximation in (14), but the values match those Moffat et al. reported using (10) with b = 20 to at least three decimal places.

However, for a range coder, the difference between the average overhead and the worst case overhead becomes much larger, as illustrated in Table 2, since the worst case becomes so much more unlikely.

b	c	f	E^*	Ē^*
32	8	23	1.0	0.126610057
32	8	22	0.584962501	0.064364130
32	8	21	0.321928095	0.032338838
32	8	20	0.169925001	0.016190150
32	8	19	0.087462841	0.008097707
32	8	18	0.044394119	0.004049184
32	8	17	0.022367813	0.002024633
32	8	16	0.011227255	0.001012322
32	8	15	0.005624549	0.000506162

Table 2: The worst case and average overhead, in bits per symbol, of a range coder for different amounts of precision in the model probabilities. Ē^* was evaluated using the approximation in (14).

For a fixed value of b - f - c , the worst case overhead E^* does not change. This means that to ensure a worst case overhead of less than 1%, a range coder requires 7 more bits of headroom, or b - f ≥ 13 .

The average overhead is about four times smaller, meaning one can achieve an expected overhead of less than 1% with b - f ≥ 10 , compared to 5 for an arithmetic coder.

3.4 Asymptotic Analysis

As can be seen in these tables, both E^* and Ē^* quickly become almost directly proportional to t. This raises questions about the constants of proportionality in these values and their ratios.

For the worst case overhead, it is easy to show that

lim	E^*·2^b-f = 2^c+1log₂(e) .
b-f → ∞

(15)

The average case requires more work. As b - f grows large, the sum in (14) can also be replaced by an integral, though in this case the result is always less than the true overhead, and so is not a conservative estimate. It does, however, produce a closed-form approximation for the average overhead, which can be used in asymptotic analysis:

		2^b-f-1
Ē^*	≈ (1/(2c))·	∫	(log₂(r+1) - log₂(r))² dr ,
		r = 2^b-f-c-1
	= (2^b-f-1(b-f-1)² - 2^b-f-c-1(b-f-c-1)² + 2log₂(e)²(Li₂(-2^b-f-c-1) - Li₂(-2^b-f-1))
	+ (1+2^b-f-1)log₂(1+2^b-f-1)·(log₂(1+2^b-f-1) - 2(b-f-1))
	- (1+2^b-f-c-1)log₂(1+2^b-f-c-1)·(log₂(1+2^b-f-c-1) - 2(b-f-c-1)))/(2c) ,

(16)

(17)

where Li₂(·) is Spence's function, a special case of the polylogarithm. This can be used to show that

lim	Ē^*·2^b-f = (2^c - 1)·(log₂(e))² / c .
b-f → ∞

(18)

Ironically, using this limit to construct a linear approximation of Ē^* in terms of t is more accurate than (17). Both (15) and (18) are approached from below, meaning that they are conservative estimates of the overhead rate. The linearity in t also suggests that when t is allowed to vary between 2^f-1 and 2^f, instead of being fixed at 2^f, the average overhead will be reduced by a factor of 25%. This is common in frequency count based modelers, which periodically rescale the counts to keep t ≤ 2^f .

Armed with these proportionality constants, the first question we can answer is, how much better is the average case than the worst case?

lim	Ē^* / E^* = (2^c-1)log₂(e)/(2^c+1c) .
b-f → ∞

(19)

This limit is approached from above, meaning that this is the best improvement over the worst case that one can expect as the headroom b - f increases. For c = 1, this means the average overhead is about 36.1% of the worst case overhead (compared to Moffat et al.'s rough estimate of 40%). For a range coder, this works out to 8.98%.

To compare two different output sizes c₁ and c₂ directly, it is easy to show that

lim	E_c₁^* / E_c₂^* = 2^c₁-c₂ .
b-f → ∞

(20)

This limit is approached from below, meaning that the worst case overhead for a range coder is no more than 128 times that of an arithmetic coder for the same amount of headroom.

Repeating the analysis for the average case, one finds that

lim	Ē_c₁^* / Ē_c₂^* = (2^c₁-1)·c₂ / ((2^c₂-1)·c₁) .
b-f → ∞

(21)

Again, this limit is approached from below, meaning that the average overhead for a range coder is no more than 31.875 times that of an arithmetic coder. Thus, to achieve the same average performance, a range coder will require no more than 5 additional bits of headroom.

4. Conclusion

The speed benefit of range coders comes with an additional overhead penalty. This article has attempted to quantify how large this penalty is, and to allow an estimation of the overhead rate in terms of the word size, probability precision, and output size. This allows the establishment of an appropriate trade-off between precision of the model and the headroom required by the coder.

It also allows quick exploration of other design choices. For example, many current DSPs do not offer byte-wise addressing, making it more convenient to use c = 16 than c = 8. A quick glance at (15) and (18) shows that the worst case overhead would increase by a factor of up to 256, and the average case overhead would increase by a factor of up to 128.5. Thus one would expect to require an additional 7 bits of headroom to achieve a comparable average overhead. If b = 32 and f is no larger than 15, then this would still provide an expected overhead of less than 1%.

However, for word-based text compression, the alphabet size alone may grow larger than 2¹⁵, making this clearly unacceptable for that purpose. When choosing f, one must trade off improvement in coder overhead against impairment in modeler accuracy. If one is using a frequency-count based model, a smaller f also means more frequent rescaling, and thus carries some computational overhead as well.

¹Although only one of the two multiplications in (4) is needed for each symbol, another is needed to update L. This formula also reverses the orientation of Moffat et al.'s original formulas, so that the first symbol receives all the extra weight due to the truncation in the division, instead of the last. This formulation is more convenient, as 0 is often the most probable, so no rearrangement of the alphabet is required to minimize the overhead.

²For example, if the probabilities are all exact powers of two, R will only take on a small subset of its possible values.

References

[Mart79]: G. Nigel N. Martin: "Range Encoding: An Algorithm for Removing Redundancy from a Digitised Message." In Proc. of the Institution of Electronic and Radio Engineers International Conference on Video and Data Recording, No. 43, July 1979.
[MNW98]: Alistair Moffat, Radford M. Neal, and Ian H. Witten: "Arithmetic Coding Revisited." ACM Transactions on Information Systems 16(3):256–294, July 1998.
[Pasc76]: Richard C. Pasco: "Source Coding Algorithms for Fast Data Compression." Ph.D. thesis, Stanford University Department of Electrical Engineering, May 1976.
[Riss76]: Jorma J. Rissanen: "Generalised Kraft Inequality and Arithmetic Coding." IBM Journal of Research and Development 20(3):198–203, May 1976.
[Schi98]: Michael Schindler: "A Fast Renormalisation for Arithmetic Coding." In Proc. of the 8th Data Compression Conference (DCC'98), p. 572, March 1998.

Comments or questions? Send them to <tterribe@xiph.org>.

Created 7 Mar. 2008, last updated 7 Jun. 2008