Overview

"Theora" is the Ogg video codec built from the VP3 codec that On2 released to the open source world in 2002. Although the original encoder is now maintained by Xiph.Org and has been updated to the Theora specification, it has not been improved over its historical coding efficiency level which is low by modern standards. The "Thusnelda" encoder project [source available in Xiph.Org SVN] addresses the inadequacies of the reference encoder by building a new, modern encoder incrementally from the inherited encoder.

Thusnelda is an incremental update project to avoid the problems faced by theora-exp and to some extent Dirac: Video coding is an inherently heuristic and highly tuned process where small but important changes can make large difference in the efficiency of an encoder. Writing a complete new encoder from scratch is a small task compared to the time required to then tune that new encoder into efficient operation. Incremental changes allow demonstrable, steady progress.

Update scope

A short list of progress milestones since the original Thusnelda encoder proposal includes:

  1. Code review for the existing Theora encoder is long complete.

  2. The inefficient block SAD pre-analysis code that attempted to determine which blocks to code/not code is eliminated; this function is best handled as an aspect of Rate/Distortion optimization in the coding loop.

  3. The 64-pass-per-frame (no, I'm not kidding) DCT token 'optimization' and packing loop is replaced by a single pass algorithm that also has somewhat improved coding efficiency (the old algorithm was not handling all EOB run corner cases).

  4. Multi-step, multi-pass coding which performed seperate complete passes for pre-analysis, mode selection, motion vector search, transform, tokenization, coding scheme selection and finally coding (with persistent 'temporary' storage linking each stage together) has been replaced with a 1-and-a-half pass loop. Unfortunately, chroma blocks must still have their own R/D coding loop due to a needlessly complex Hilbert-curve coding order that mandates different block coding order in each plane that can't be colocated.

  5. New mode selection code based on theora-exp. Rather than basing mode selection on hard thresholding of analysis parameters (ie, 'guessing alot, and always guessing the same way whether it works or not'), mode selection is now able to estimate bit usage for any mode choice, and makes decisions accordingly.

  6. New motion estimation and compensation code, based on algorithm presented in "Enhanced Predictive Zonal Search for Single and Multiple Frame Motion Estimation" by Alexis M. Tourapis.

For the most part, the extensive code restructuring throughout the encoder is not an improvement in its own right, but enables improvements. That said, it has had the effect of improving data colocation and so decreasing cache misses. In all, the Thusnelda encoder is under 60% the code size of the original Theora encoder at this point. Execution speed has improved, but the amount of improvement depends on the amount of motion in the video sequence. Thusnelda can be reasonably expected to be between 1.5x and 4x faster than the current mainline, despite the fact that Thusnelda currently encodes all blocks (on average, mainline codes less than half) and does not make as extensive use of assembly as the mainline.

Demonstration of improvements

The majority of the time and effort spent on the Thusnelda encoder since last update involves complete replacement of both the motion estimation and mode selection code.

Debugging telemetry

Theora playback frame with motion vector and macroblock debugging 'telemetry' active. The telemetry feature allows libtheora to write debugging information directly into the video frame output.

The decoder included with Thusnelda includes support to render debugging information into the output frame. This functionality is available through the theora_control call; see decoder_example.c for an illustration. This patch adds a -theoradopts command line option to mplayer for enabling debugging telemetry. The numeric option specifies a bitflag for which blocktypes to display; the block types are:

    0x01 == CODE_INTER_NO_MV
    0x02 == CODE_INTRA
    0x04 == CODE_INTER_PLUS_MV
    0x08 == CODE_INTER_LAST_MV
    0x10 == CODE_INTER_PRIOR_LAST
    0x20 == CODE_USING_GOLDEN
    0x40 == CODE_GOLDEN_MV
    0x80 == CODE_INTER_FOURMV
    
For example, mplayer -theoradopts vismv=0xff file.ogg would display all motion vectors, coded and implicit. mplayer -theoradopts vismv=0xff:vismbmode=0xdc file.ogg would display all blocks with a motion vector.

Mike Smith has provided a similar patch for GStreamer

Motion Estimation and Motion Vectors

The primary intent of the MV-specific work was to increase speed. Motion estimation in the mainline Theora encoder accounts for half or more of encoding time.

The previous generation of motion extimation techniques concentrated on optimizing exhaustive searches of the solution space. Because an exhaustive search is slow, these encoders typically performed a partial search first, and if the results of the partial search failed to reach a particaulr gain threshold, the encoder then performed an exhaustive search.

This strategy has the obvious drawback of inefficiency (either the answer is incomplete or expensize) and has a second less obvious drawback. Encoders rely heavily on reusing previously coded motion vectors (to save bits by not recoding a close match). An exhaustive search has no predisposition to prefer previously found vectors, or to prefer 'close matches'. more cost effective.

'Zonal Search' algorithms base motion vector searches on refinements of motion vectors from previous frames as well as neighboring blocks from the current frame. The new zonal search algorithm in Thusnelda always performs a complete search, but currently has a worst case speed roughly equal to Theora's best case. Thusnelda is roughly ten times faster than Theora's worst case. An additional optimization (see future work below) could potentially double performance again.

A quick example of old and new MV coding appears below. The 'embedded telemetry' versions of the videos are recoded from the originals with the debugging output included as part of the encoded file so that folks who don't want to install the vis patches just to enjoy this little demo don't need to. They're much larger that the originals, mainly to preserve the clarity of the debugging symbols.

Left: example of motion vectors in a frame of video from mainline Theora encode. The original video is here, and a version with embedded telemetry is here.

Left: example of motion vectors in a frame of video from the Thusnelda encoder. The original video is here, and a version with embedded telemetry is here.

Mode Selection

Optimization of block type selection (mode select) was primarily to increase coding efficiency. The new MV and mode selection code alone, compared directly to the algorithm used by mainline Theora, currently improves bitrate coding efficiency approximately 10%.

Left: example of mode selection in a frame of video from mainline Theora encode. The original video is here, and a version with embedded telemetry is here.

Left: example of mode selection in a frame of video from mainline Theora encode. The encoder in question here has been modified to code all blocks and disable ZeroBin, such that the only differences we're comparing between mainline Theora and Thusnelda are the motion and mode code. The original video is here, and a version with embedded telemetry is here.

Left: example of mode selection in a frame of video from the Thusnelda encoder. The original video is here, and a version with embedded telemetry is here.

Remaining work

Immediate (next two weeks)

  1. Bugfix: There is still a token packing bug in the most recent work. This is a run-of-the-mill bug, worth mentioning in case anyone reading this goes off to play with the new code. We know about it. Update: bug fixed in r14620

  2. Bugfix: The new motion estimation code often produces nonsensical motion vector results at frame boundaries when a pan is flowing video from the edge onto the screen. This is the primary reason that the 10%-ish efficiency coding gain of the new mode selection code often disappears for small frame sizes (eg, CIF). Update: bug fixed as of r14630

Above: The new motion estimation code incorrectly chooses motion vectors following the frame boundary when the scene is flowing into the frame from the edge.

  1. Contextual weighting of mode selection: The new mode selection code currently does not account for the coding penalties of breaking up runs of CODE_INTER_LAST_MV and CODE_INTER_PRIOR_LAST blocks, nor does it account for the savings of using an appropriate CODE_INTER_PLUS_MV block to start such a run.

  2. The halfpel refinement in the motion estimation code accounts for roughly half of the execution time of motion estimation, yet the results are most often tossed away (when a CODE_INTER_PLUS_MV block is not chosen). This is done such that a SAD metric equivalent to the measurements made on the other block types can be used during bit usage estimation when coding blocks. It is possible that an asymmetric estimation is possible, eliminating the wasted analysis and possibly increasing global coding speed by another 20-30%.

within 1.0 spec

  1. Rate-Distortion optimization: The code has been thoroughly rearranged at this point into a proper per-block, per-token rate/distortion optimization loop. What remains is to add the actual R-D decision making code.

  2. Make use of the multi-resolution quantization feature of the Theora spec; this allows us to use adaptive quantization within a frame.

  3. More sensible quantization matricies in general.

  4. Reenable rate management using the new R-D code once it's working.

beyond 1.0 spec

Although this wish list is substantially covered in the original proposal, here's an updated summary of the most immediately important aspects:

  1. Range coding / elimination of token engine.

  2. Minor rearrangement of coding order to improve cacahe coherence.

  3. Elimination of Hilbert Curve coding order to improve cache coherence.

  4. Lapped transforms.


Monty