The last demo page mentioned that Daala's use of lapped transforms affects multiple aspects of the codec's structure. Classic intra-frame prediction is one example of a Catch-22 caused by lapping: We can't predict from neighboring pixels that have not yet been fully decoded, and we can't finish decoding these pixels until we lap them with the current block, which of course can't be decoded until after prediction.
To unravel this circular dependency, Daala predicts from neighboring frequency coefficients rather than neighboring pixels. This strategy also makes use of a larger amount of prediction information, as it predicts from several full blocks of coefficients, rather than a thin strip only one or two pixels wide.
As we code (and decode) the frames of a video block by block, we try to predict the contents of the each block as accurately as possible from previously decoded information; the closer we come to perfect prediction, the fewer bits we need to code the remaining difference. Prediction falls into two major categories: Inter-prediction, which predicts the contents of a block based on image data from previous frames, and intra-prediction, which predicts the contents of a block from neighboring blocks previously decoded in the current frame.
The simplest form of intra-prediction only predicts the DC coefficient of each block. Block-based image codecs have used DC prediction from the very beginning, and because the DC coefficient accounts for most of a typical block's energy, even a simple DC predictor (such as reusing the previous block's DC value) dramatically improves coding efficiency. Of course, more complex DC predictors can yield even better performance.
In addition to DC, AVC/H.264 introduced directional intra prediction in which the predictor extends the pixel values of a surrounding one-pixel-wide strip into the current block in one of eight signaled directions, aka coding modes. VP8 adds a tenth 'TrueMotion' mode (so named for historical reasons; it actually predicts gradients). HEVC also extends the prediction mechanism, adding additional modes, progressive prediction, and predicting from a larger two-pixel-wide strip.
Lapped transforms complicate the design of a successful spatial-domain directional predictor as exists in these other codecs. Directional intra-prediction relies on neighboring pixels that have already been decoded. Lapping complicates the idea of what 'neighboring' really means. In addition, neighboring blocks themselves aren't fully decoded without adding in the lapped values from decoding the current block. To decode the current block, we need to predict the current block, and predicting the current block lands us back at the beginning again. We need to break this circular dependency.
There are several obvious ways to relax the prediction constraints here, however most of the obvious tactics damage predictor efficiency. This is important to consider, because the efficiency advantage of a directional predictor can be tenuous to begin with. Despite offering eight directional prediction modes, for example, we see in testing that good H.264 encoders (such as x264) mostly use the DC-prediction mode. It's usually still the most efficient.
A damaged directional/spatial predictor may be worse than no predictor at all.
Daala predicts frequency coefficients in the frequency domain, rather than pixels in the spatial domain. Frequency coefficients are not shared between blocks, allowing us to keep the classic definition of 'neighboring' as well as avoiding any circular decoding dependencies. In addition, Daala's frequency-domain prediction draws from a larger input data set consisting of up to four complete blocks. In theory, the additional degrees of freedom could give us a more efficient predictor.
The frequency domain predictor is simple. We arrange the coefficients of the four neighboring, already-decoded blocks into a single vector. This vector is multiplied by a single linear predictor column vector to produce each predicted coefficient. We use a different linear predictor for each predicted coefficient, and so the complete block prediction is a matrix multiplication of the coefficient vector by a linear prediction matrix in which each column is one output coefficient's predictor.
There's a few things worth noting in this scheme. First, if we were not using a lapped transform, we'd be able to construct a frequency-domain predictor equivalent to the spatial predictor. From that it follows that the frequency domain predictor would be guaranteed to be no worse than the spatial predictor. Unfortunately, we don't have any such guarantee in the lapped case; the frequency predictor results may be better or they may be worse than the spatial case. Second, the matrix multiplication would appear to be large and expensive for a video codec. However, the frequency domain tends to be sparse overall; that's why we're coding in the frequency domain to begin with. We expect the predictors should be sparse as well.
Given the structure of the frequency-domain predictor, we still have the problem of finding a suitable matrix P for each mode in each blocksize. For purposes of proving this crazy scheme works, we tried something relatively easy first: a machine-training algorithm that turns out to be a kind of k-means clustering.
Assume we want to train ten coding modes, beginning with the ten VP8 prediction modes as a starting point. The first pass of our training algorithm classifies each block of a given size according to the starting VP8 coding mode with the lowest L1 (summed absolute) error in the frequency domain. We then compute a predictor for each output coefficient of each mode via a least-squares fit to the frequency-domain coefficients across the complete training set. This yields an initial set of frequency-domain intra predictors.
We then repeat this process using our newly trained predictors to classify the input blocks rather than the initial VP8 predictors. We continue iterating, using each run's new predictors to classify blocks for the next iteration.
After several such non-sparse iterations, we then sparsify the predictor; in the 4x4 case, we drop the 16 coefficients from P that contribute least to coding gain. Then we resume iterating. The numbers 10 and 16 here are obviously rather arbitrary; we're still exploring the solution space. However, we've now produced some early 4x4 and 8x8 predictors that prove the basic concept is workable (due to computational complexity, we've not been able to spend adequate time training the 16x16 case, so our 16x16 predictors are still loltastic).
Unlike H.264, VP8 or other codecs, the resulting modes don't really 'mean anything'; they're just the results of machine training. That said, the modes do show some recognizable spatial qualities. Some interesting training points:
The basic visual qualities of Daala's prediction are somewhat different from other codecs. Spatial directional prediction achieves low error in the spatial domain, but tends to overlay a strongly directional texture on top of whatever natural texture exists in the image. As a result, the residual is usually a mixture of the multiple spatial textures with much of the error in high frequencies. Daala optimizes prediction error in the frequency domain, so the resulting residual tends to look more like a mask of the original image. It remains to be seen if this difference is a perceptual advantage, liability, or neither.
Predictor Output | Prediction Error |
A complete prediction system must account for a number of logistical complications that we've ignored so far. These fall into two basic categories: Missing data and mixed block-sizes.
In a given coding scheme, we're not always going to have four full neighboring blocks available for prediction. Accounting for all possible factors and considering multiple potential coding orderings, we're going to be missing some information approximately 1/3 of the time. Proofing against missing information can be built into the training; by using training conditions as close as possible to actual encoding, the resulting predictors can cope with missing information. We simply ignore the blocks that don't exist (or potentially use alternate predictors trained for that case).
Multiple blocksizes also complicate prediction. We cannot have a predictor set for every possible neighboring blocksize combination; the combinatorics grow out of control. Thus we need to determine the best way to homogenize prediction when using multiple blocksizes, even in the case where neighboring blocksizes are mixed.
There are a couple different strategies for this, but they all rely on being able to alter blocksizes after the fact, something I haven't mentioned yet. To do this, Daala borrows a technique from Opus for combining and splitting transformed blocks that we refer to as Time-Frequency Resolution Switching, or TF for short.
That's for the next demo (now posted), so stay tuned :-)
—Monty (monty@xiph.org) July 23, 2013