next generation video: Introducing Daala

aala is a new general-purpose video codec currently under development at Xiph.Org. Our performance target is roughly a generation beyond current 'next-generation' codecs like VP9 and HEVC, making Daala a next-next-generation effort. As with other Xiph codecs, the Daala format is and will always be royalty-free with a permissive FOSS license.

On May 30th 2013, our in-development Daala prototype encoded and decoded its first streams. Two hours later, Mozilla's David "oneman" Richards streamed the first live Daala video over the Internet. Although the project is still pre-pre-alpha, I think it's time to introduce Daala to the world.

Meet Crazy Eddie

The next-generation VP9 and HEVC codecs are the latest incremental refinements of a basic codec design that dates back 25 years to h.261. This conservative, linear development strategy evolving a proven design has yielded reliable improvement with relatively low risk, but the law of diminishing returns is setting in. Recent performance increases are coming at exponentially increasing computational cost.

Daala tries for a larger leap forward— by first leaping sideways— to a new codec design and numerous novel coding techniques. In addition to the technical freedom of starting fresh, this new design consciously avoids most of the patent thicket surrounding mainstream block-DCT-based codecs. At its very core, for example, Daala is based on lapped transforms, not the traditional DCT.

Strengths and Weaknesses of the DCT

Modern codecs typically encode video at bitrates that work out to substantially less than one bit per pixel. They do this by grouping pixels together, typically into 4x4, 8x8 or 16x16 blocks, then transforming the blocks with a two-dimensional Discrete Cosine Transform (DCT).

Above: Illustration of a basic video encoder; each video frame is broken into individual 8x8 pixel blocks, then each block is transformed by the DCT. More complex encoders still perform these high level steps.

The DCT converts the blocks of pixels into same-sized blocks of frequency coefficients. It also compacts most of the energy from the original pixels into just a few coefficients, arranged such that reduced precision in the frequency domain is far less noticeable than reduced precision in the spatial domain. Thus, the DCT makes it possible to encode the image in less space, as the most important aspects of the image are represented in a more compact form.

Above: Illustration of a single 8x8 block of grayscale pixels transformed into an 8x8 block of frequency coefficients by the DCT. Although the DCT outputs the same number of frequency coefficients as input pixels, the DCT coefficients typically represent the same information in a more compact form that can be stored in fewer bits. In addition, purposely losing precision in the DCT domain is much less noticable than losing the same amount of information in the spatial domain. This is the primary basis of operation of most modern lossy video codecs.

An encoder then quantizes the DCT coefficients to a lower precision, usually via a complex heuristic algorithm that's been carefully tuned. Let's handwave that away for now, and do the simplest possible thing: quantize all the coefficients by a fixed quantizer. The decoder then uses an inverse 2D DCT to reconstruct the original image block by block.

Above: Illustration of a basic decoder reconstructing the original image from blocks of quantized DCT coefficients. Each block of quantized coefficients is converted back into a block of pixels by an inverse DCT. The blocks of pixels are then reassembled into a complete image.

There's an obvious change here; we can clearly distinguish the sharp boundary of each individiual block in the reconstructed image. A DCT is a circular [edit: braino there, I meant symmetric not circular] transform, and so the block edges introduce strong discontinuities. When we invert the transform, the spatial error introduced by quantization causes the block edges to no longer 'line up'. Average spatial error is also greater at the block edges, which further exacerbates the problem.

Above: Illustration of the 'screen door' effect, in which the reconstructed blocks of a DCT-based codec show an obvious grid of boundaries. The effect is particularly noticeable in video, where the grid stays fixed while objects in the video move. Even motion-compensated codecs can display this effect when the underlying blocks do not move perfectly in sync with objects in the image.

For this reason, virtually all video codecs use a deblocking filter to smooth over the edges between blocks. This filter may be a purely post-processing step, or it may exist within the encoding loop in which case it's called a loop filter. This filter began life decades ago as something of an afterthought, but it's an integral part of modern codecs accounting for a significant portion of the design complexity and CPU cost.

A deblocking filter is at best a necessary evil. It discards legitimate detail at block edges at the same time it smooths the boundary discontinuities. Further, the details the deblocking filter unintentionally discards cost bits and CPU time to encode. Finally, good deblocking filters are complex and heuristic, requiring detailed analysis (and in some cases, additional signaling) to control conditional application. The heuristics have become quite good, but they can go wrong.

A lapped transform renders this necessary evil unneccessary. It has some other nice benefits as well.

Above: Comparison of original image to images transformed and quantized using the DCT and Daala's lapped transform. Both transforms used the same scaling and coarse flat quantizer to simulate low-bitrate encoding. The summed log-energy after quantization was the same for both transforms.

The Lapped Transform

A lapped transform is any transform where each block overlaps its neigboring blocks. Audio has used overlapping transforms since approximately forever; the best known example is the Modified Discrete Cosine Transform (MDCT) that powers MP3, Vorbis, AAC, and Opus. Many other transforms also fall under the 'lapped transform' definition (and it's important to point out Daala does not use the MDCT), so let's be more specific.

First, consider a deblocking filter that's invertible; a wide variety of such filters are possible, if not widely used. Next suppose that the inverse of the deblocking filter is applied before the forward transform as a pre-filter P, and the deblocking filter is applied after the inverse transform as a post-filter. If we use the DCT as our transform as above, the pre-filter + DCT is now perfectly inverted by the inverse DCT + post-filter.

Above: Illustration of one possible decomposition of a lapped transform, consisting of the DCT with pre-filters (P) and post-filters (P-1) straddling block boundaries.

The post-filter is similar to our deblocking/smoothing filter from earlier. The prefilter is the inverse, so it's literally the opposite; it purposely makes the input image very blocky by reducing the circular discontinuity at the edges. When paired with the DCT, this means it compacts more energy into the DC component by reducing energy leakage into higher-frequency bins.

Above: We construct a lapped transform from the DCT and spatial pre- and post-filtering. The post-filter has a deblocking function, merging block contents together seamlessly. The pre-filter does the opposite, giving the image an intentionally blocky appearence. The blockiness is a result of reducing spurious high frequency energy from the edge discontinuity, flattening the apparent content of each block and increasing average energy compaction.

The remaining goal is to find a specific P (and P-1) that, along with the DCT, forms a new lapped transform with the visual, coding, and computational properties we desire.

Lifting Implementation

Using spatial-domain lapping and lapping transforms for images and video is not a new idea, even if it's not mainstream. Papers on the subject date back to Malvar et. al. in the early 1990s, so we have a decent amount of preexisting research to draw upon.

Lifting filters are another idea that's not new, though they are relatively recent. Originally a product of wavelet transform research, the generalized idea is powerful: A lifting filter is comprised of sequence of serialized lifting steps also simply called lifts. Each step updates a single variable xi in place from an arbitrary function f() that does not depend on xi. That is:

The function f() is usually quite simple, e.g. just one other value scaled. Lifting steps are often arranged in alternating lifts where value xa is updated from value xb, and then value xb is updated from value xa. The literature often refers to the alternating lift steps as 'predict' (P) and 'update' (U or Q) steps as a result of their origin in second-generation wavelet transforms.

A lifting filter is always exactly reversible; the steps can simply be unwound in the opposite order.

Daala uses pre/post lifting filters based on the filters proposed by Tran, Liang, and Yu in Lapped Regularity-Constraind Pre- and Post-Filtering for Block DCT Based Systems. The prefilter P consists of forward and inverse lifting butterflies and transform matrices U and V. It turns out that U and V duplicate each others' degrees of freedom, so we set U to the identity matrix and eliminate it from the filter design.

Above: This illustration shows the proposed Daala pre- and post-filter structures as implemented for an 8-point lapped transform. The U matrix adds no additional degrees of freedom over the V matrix, thus we leave it out.

Research literature suggests several possible models for matrix V; the 'type-III' and 'type-IV' models allow easy lifting implementations that suit our desired transform structure. These simplified models allow selection and optimization of the parameters via nonlinear optimization techniques or even exhaustive search.

Above: Figure from Regularity-Constrained Pre- and Post-Filtering for Block DCT Based Systems (Tran et. al. 2001) illustrating two lifting-based implementation models for matrix V.

The DCT itself can also be implemented via lifting, and we've implemented one with both perfect reconstruction and orthonormal scaling. Placed together with the pre-filter, we can implement the entire forward lapped transform as a complete lifting structure.

Above: Daala's complete 8x16 forward lapped transform implemented as a monolithic lifting filter with a type-III V matrix. Mouse over the diagram to see the filter's decomposition into P, V and the forward DCT.

As any lifting filter is inherently invertible, constructing the inverse transform is a purely mechanical process. Lifting also makes it trivial to implement exact reconstruction as an integer transform. Quantization error carried through the forward transform is exactly reversed in lock-step fashion by the inverse transform. This can both dramatically reduce required numeric resolution (Daala's filters currently use six-bit coefficients) and also opens the possibility of fully lossless operation.

Coming Soon

In the next demo (update: now ready!), I'll continue exploring some of the interesting new (or at least, non-mainstream) techniques used in Daala. There's more to cover regarding implications of the lapped transform, and that leads nicely into frequency-domain intra prediction.