Theora: Thusnelda project update 20080320
Overview
"Theora" is the Ogg video codec built from the VP3 codec that
On2 released to the open source world in 2002. Although the
original encoder is now maintained by Xiph.Org and has been
updated to the Theora specification, it has not been improved
over its historical coding efficiency level which is low by
modern standards. The "Thusnelda" encoder project [source
available in Xiph.Org SVN] addresses the inadequacies of
the reference encoder by building a new, modern encoder
incrementally from the inherited encoder.
Thusnelda is an incremental update project to avoid the
problems faced by theora-exp and to some extent Dirac: Video
coding is an inherently heuristic and highly tuned process
where small but important changes can make large difference in
the efficiency of an encoder. Writing a complete new encoder
from scratch is a small task compared to the time required to
then tune that new encoder into efficient operation.
Incremental changes allow demonstrable, steady progress.
Update scope
A short list of progress milestones since the original Thusnelda encoder proposal includes:
- Code review for the existing Theora encoder is long complete.
- The inefficient block SAD pre-analysis code that attempted
to determine which blocks to code/not code is eliminated;
this function is best handled as an aspect of
Rate/Distortion optimization in the coding loop.
- The 64-pass-per-frame (no, I'm not kidding) DCT token
'optimization' and packing loop is replaced by a single pass
algorithm that also has somewhat improved coding efficiency
(the old algorithm was not handling all EOB run corner
cases).
- Multi-step, multi-pass coding which performed seperate
complete passes for pre-analysis, mode selection, motion
vector search, transform, tokenization, coding scheme
selection and finally coding (with persistent 'temporary'
storage linking each stage together) has been replaced with
a 1-and-a-half pass loop. Unfortunately, chroma blocks must
still have their own R/D coding loop due to a needlessly
complex Hilbert-curve coding order that mandates different
block coding order in each plane that can't be colocated.
- New mode selection code based on theora-exp. Rather than
basing mode selection on hard thresholding of analysis
parameters (ie, 'guessing alot, and always guessing the same
way whether it works or not'), mode selection is now able to
estimate bit usage for any mode choice, and makes decisions
accordingly.
- New motion estimation and compensation code, based on
algorithm presented in "Enhanced Predictive Zonal Search for
Single and Multiple Frame Motion Estimation" by Alexis
M. Tourapis.
For the most part, the extensive code restructuring throughout the
encoder is not an improvement in its own right, but enables
improvements. That said, it has had the effect of improving data
colocation and so decreasing cache misses. In all, the Thusnelda
encoder is under 60% the code size of the original Theora encoder
at this point. Execution speed has improved, but the amount of
improvement depends on the amount of motion in the video sequence.
Thusnelda can be reasonably expected to be between 1.5x and 4x
faster than the current mainline, despite the fact that Thusnelda
currently encodes all blocks (on average, mainline codes less than
half) and does not make as extensive use of assembly as the
mainline.
Demonstration of improvements
The majority of the time and effort spent on the Thusnelda encoder
since last update involves complete replacement of both the motion
estimation and mode selection code.
Debugging telemetry
Theora playback frame with
motion vector and macroblock debugging 'telemetry' active.
The telemetry feature allows libtheora to write debugging
information directly into the video frame output.
The decoder included with Thusnelda includes support to render
debugging information into the output frame. This functionality is
available through the theora_control call; see
decoder_example.c for an illustration.
This patch adds a -theoradopts
command line option to mplayer for enabling debugging
telemetry. The numeric option specifies a bitflag for which
blocktypes to display; the block types are:
0x01 == CODE_INTER_NO_MV
0x02 == CODE_INTRA
0x04 == CODE_INTER_PLUS_MV
0x08 == CODE_INTER_LAST_MV
0x10 == CODE_INTER_PRIOR_LAST
0x20 == CODE_USING_GOLDEN
0x40 == CODE_GOLDEN_MV
0x80 == CODE_INTER_FOURMV
For example, mplayer -theoradopts vismv=0xff file.ogg
would display all motion vectors, coded and implicit. mplayer
-theoradopts vismv=0xff:vismbmode=0xdc file.ogg would display
all blocks with a motion vector.
Mike Smith has provided a similar
patch for GStreamer
Motion Estimation and Motion Vectors
The primary intent of the MV-specific work was to increase
speed. Motion estimation in the mainline Theora encoder accounts
for half or more of encoding time.
The previous generation of motion extimation techniques
concentrated on optimizing exhaustive searches of the solution
space. Because an exhaustive search is slow, these encoders
typically performed a partial search first, and if the results of
the partial search failed to reach a particaulr gain threshold,
the encoder then performed an exhaustive search.
This strategy has the obvious drawback of inefficiency (either
the answer is incomplete or expensize) and has a second less
obvious drawback. Encoders rely heavily on reusing previously
coded motion vectors (to save bits by not recoding a close
match). An exhaustive search has no predisposition to prefer
previously found vectors, or to prefer 'close matches'. more cost
effective.
'Zonal Search' algorithms base motion vector searches on
refinements of motion vectors from previous frames as well as
neighboring blocks from the current frame. The new zonal search
algorithm in Thusnelda always performs a complete search, but
currently has a worst case speed roughly equal to Theora's best
case. Thusnelda is roughly ten times faster than Theora's worst
case. An additional optimization (see future work below) could
potentially double performance again.
A quick example of old and new MV coding appears below. The
'embedded telemetry' versions of the videos are recoded from the
originals with the debugging output included as part of the
encoded file so that folks who don't want to install the vis
patches just to enjoy this little demo don't need to. They're
much larger that the originals, mainly to preserve the clarity of
the debugging symbols.
Left: example of motion vectors
in a frame of video from mainline Theora encode. The
original video is here,
and a version with embedded telemetry
is here.
Left: example of motion
vectors in a frame of video from the Thusnelda encoder. The
original video
is here, and a version
with embedded telemetry
is here.
Mode Selection
Optimization of block type selection (mode select) was primarily
to increase coding efficiency. The new MV and mode selection code
alone, compared directly to the algorithm used by mainline Theora,
currently improves bitrate coding efficiency approximately 10%.
Left: example of mode
selection in a frame of video from mainline Theora encode. The
original video is here, and a
version with embedded telemetry is here.
Left: example of
mode selection in a frame of video from mainline Theora encode.
The encoder in question here has been modified to code all blocks
and disable ZeroBin, such that the only differences we're
comparing between mainline Theora and Thusnelda are the motion and
mode code. The original video is here, and a version with
embedded telemetry is here.
Left: example of
mode selection in a frame of video from the Thusnelda encoder.
The original video is here, and a version with
embedded telemetry is here.
Remaining work
Immediate (next two weeks)
- Bugfix: There is still a token packing bug in the most
recent work. This is a run-of-the-mill bug, worth mentioning in
case anyone reading this goes off to play with the new code. We
know about it. Update: bug fixed in r14620
- Bugfix: The new motion estimation code often produces
nonsensical motion vector results at frame boundaries when a pan
is flowing video from the edge onto the screen. This is the
primary reason that the 10%-ish efficiency coding gain of the
new mode selection code often disappears for small frame sizes
(eg, CIF). Update: bug fixed as of r14630
Above: The new motion
estimation code incorrectly chooses motion vectors following the
frame boundary when the scene is flowing into the frame from the
edge.
- Contextual weighting of mode selection: The new mode
selection code currently does not account for the coding
penalties of breaking up runs of CODE_INTER_LAST_MV and
CODE_INTER_PRIOR_LAST blocks, nor does it account for the
savings of using an appropriate CODE_INTER_PLUS_MV block to
start such a run.
- The halfpel refinement in the motion estimation code
accounts for roughly half of the execution time of motion
estimation, yet the results are most often tossed away (when a
CODE_INTER_PLUS_MV block is not chosen). This is done such that
a SAD metric equivalent to the measurements made on the other
block types can be used during bit usage estimation when coding
blocks. It is possible that an asymmetric estimation is
possible, eliminating the wasted analysis and possibly
increasing global coding speed by another 20-30%.
within 1.0 spec
- Rate-Distortion optimization: The code has been thoroughly
rearranged at this point into a proper per-block, per-token
rate/distortion optimization loop. What remains is to add the
actual R-D decision making code.
- Make use of the multi-resolution quantization feature of
the Theora spec; this allows us to use adaptive quantization
within a frame.
- More sensible quantization matricies in general.
- Reenable rate management using the new R-D code once it's
working.
beyond 1.0 spec
Although this wish list is substantially covered in the original proposal, here's an updated summary
of the most immediately important aspects:
- Range coding / elimination of token engine.
- Minor rearrangement of coding order to improve cacahe coherence.
- Elimination of Hilbert Curve coding order to improve cache coherence.
- Lapped transforms.
Monty