next generation audio: CELT update 20101223

Overview

The Vorbis codec is more than halfway through its approximate intended lifetime of 20 years or so, and the state of the art in audio coding has improved considerably since Vorbis's introduction. Xiph has been developing two new, next-generation codecs (Ghost and CELT) as successors to Vorbis. Ghost research was postponed until recently to devote more resources to improving video (see the 'Thusnelda' and 'Ptalarbvorm' encoders), but Jean-Marc Valin of Xiph's Speex project has been able to continue work on CELT since late 2007. As of December 2010, CELT is nearing bitstream freeze and has been submitted to the IETF codec working group as an input codec.

The latest version of the CELT reference library implementation can always be downloaded from http://celt-codec.org/downloads.

What is CELT?

CELT is a general purpose, low-delay codec intended for similar use and performance cases as Vorbis, but with the additional features of very low delay and low CPU/memory requirements. CELT supports stereo, can achieve a total algorithmic delay as low as 5ms, scales well to lower bitrates than Vorbis, and currently provides superior audio fidelity to Vorbis on many if not most natural audio inputs. From 24kbps through 64kbps 48kHz stereo, it is comparable quality to HE-AAC v1 and provides considerably higher audio fidelity than AAC-LD with equal or much lower delay.

By way of general feature summary:

Headerless
Arbitrary sampling rate
Mono/stereo encoding
Fixed-point encode and decode
VBR/constrained VBR/true CBR
Very low delay (arbitrary delay, 5ms minimum total latency at 48kHz)
low-bitrate performance ('sweet spot' >= 32kbps for 48kHz stereo)
flexible streaming with the ability to change most codec parameters mid-stream (so long as changes are signaled; this information is not in-band in CELT, though it is in-band for OPUS. More about OPUS later).
royalty-free, no licensing required, BSD reference code

While primarily targeted at packet oriented networks, CELT includes a number of design features that increase robustness to bit errors, also making it suitable for non-IP wireless applications.

Why low delay?

The minimum algorithmic delay for a typical Vorbis encoding mode is over 100ms; in current encoders it is actually considerably higher (between 200 and 400ms). This delay is typical of other general-purpose audio codecs (mp3, AAC, etc) not intended for realtime telepresence applications. Low delays, typically 20ms or ideally much less, are a hard requirement for applications such as collaborative music. Although speech applications typically tolerate latencies of around 100ms, even here lower latencies can make interaction more natural and less stressful for the speaker and listener.

Sample 1: 250ms delay between performers

Sample 2: 15ms delay between performers

These samples illustrate the difficulties caused by latency in realtime communication. Two participants attempt to sing 'Row, Row, Row Your Boat' together over telepresence links with latencies of 250ms (above) and 15ms (below). High latencies make the task fairly difficult.

This is similar to the 'cell phone collisions' many cell phone users experience when the lower but still significant (~100ms) delays over a cell phone link result in both speakers repeatedly beginning to speak at the same time, both stopping to let the other speaker continue, and then both beginning to speak again resulting in another collision. Lather, rinse, repeat.

Since low delay is clearly a desirable trait in a codec, the obvious question would be, "why not design every codec to be low latency?" Unfortunately, low latency impacts codec efficiency. High latencies allow considerably higher energy compaction (and thus coding efficiency) as well as deeper analysis of input signals. Low latency design is more difficult for a comparable level of bitrate performance.

CELT Design Overview

Simplified CELT block diagram

"CELT" stands for "Constrained Energy Lapped Transform" an accurate and remarkably unforced acronym. It is exactly that: A lapped transform codec with a psychoacoustic design philosophy based on band-energy preservation.

Lapped transform

CELT is a lapped transform-domain codec like Vorbis and AAC, however it uses quite short windows with low overlap to achieve low latency.

Despite the small windows and short overlap, transient pre-echo suppression occasionally demands yet shorter windows. In this case, the frame is split and smaller MDCTs are done on each piece. The results are then interleaved and coded as normal.

Critical bands

The spectral lines produced by each transform are grouped and coded by critical band. This both holds coding noise within critical bands and also provides approximately correct band energy resolution.

Constrained Energy

The single most important new discovery in Vorbis was that preserving narrowband energy produces far superior results to earlier techniques that attempted to globally minimize quantization noise. This was a relatively late discovery in the Vorbis project, and although it was easy enough to add energy preservation to the Vorbis encoder ('Noise Normalization'), Vorbis did not incorporate energy preservation as an inherent design concept.

CELT's design assumes unity narrowband energy gain throughout. The absolute energy of each band is explicitly coded, and every entropy-backend codeword also encodes unity energy. Critical band spectral energy and the coarse shape of the spectral envelope is thus preserved no matter what.

Variable Time-Frequency Resolution

In addition to increasing time resolution via frame splitting, CELT can also further adjust time/frequency resolution by performing Hadamard transforms in one band. A forward Hadamard transform over several blocks increases frequency resolution and an inverse transform in one block increases time resolution (though with more temporal leakage than via frame splitting). TF adjustment is signaled per band and used to further bias a frame toward more accurately encoding tonal or transient content.

norm-energy PVQ range encoder

CELT encodes energy in each band explicitly. The spectral residue of each band is quantized as a whole band using a fixed number of spectral energy 'pulses' (K). These pulses are amplitude (not energy) quanta that total an amplitude of 1.; each pulse represents an amplitude of 1./K. Each codeword is thus an N-dimensional vector of integer magnitudes that sum to K. The codeword space is obviously countable, representing points on the surface of an N-1 orthoplex (the dual of a hypercube; a 3 dimensional orthoplex is an octahedron).

The astute reader will notice that in the above explanation, each codeword represents a fixed summed amplitude, not a fixed energy. The orthoplex is warped such that the energy of each vector is normalized to an energy of 1., inflating the vectors to points on an N-1 sphere. The direction of each vector is not altered, resulting in higher resolution at the 'poles'. In this form, the codewords also turn out to have approximately flat probability, eliminating the need for entropy encoding of residual data.

The specific implementation of this coding technique used by CELT is known as Pyramid Vector Quantization (Fischer, 1986) The design neatly sidesteps any need for Vorbis-like codebooks in residue coding.

Pulse Spreading

If too much diffuse energy in a band collapses into just a few pulses due to very low bitrate coding, this causes the classic swirling/metallic artifacts typical of transform codecs. These artifacts are mostly associated with mp3, which has the least ability to mitigate the problem.

Following the equation above, spectral collapse happens primarily when K is small, causing an audibly sparse spectrum. Spreading essentially jitters pulses around as a kind of spectral dither; if and when low-bitrate encoding collapses a noisy spectrum into just a few pulses after the forward spreading filter, the inverse filter in the decoder 'unjitters' the collapsed energy, spreading it back out across the narrowband spectrum.

It might not be obvious from the description above, but spreading is purely a forward/inverse filtering operation; there is no additional side information transmitted. The only additional signaling is whether folding is enabled or disabled.

Band Folding

This has a similar effect to Spectral Band Replication (SBR), except that we don't replicate bands, we just reuse residue codewords from lower bands, reconstituted in the context of the encoded energy of the higher band. Much simpler in concept and execution than SBR, a lucky break of the PVQ design.

Left: an illustration of constrained energy, pulse spreading, and band folding in action. The spectrogram shows the original mono sample, MP3 at 32kbps, Vorbis at 32kbps and CELT at 32kbps. Vorbis also preserves band energy but does not have the additional spreading and folding mechanisms.

Click on the label to show the spectrogram for each codec.

pre/postfilter

Due to the relatively poor frequency resolution resulting from CELT's very short windows, encoding strongly tonal content is challenging in ways atypical of a transform codec. Strongly tonal content requires additional techniques over straight transform/quantize for efficient coding.

As part of the CELT work going on within the IETF codec working group, Raymond Chen of Broadcom submitted a technique to weight tonal content for more efficient encoding using pitch prediction and a matched comb filter. In the encoder, the comb filter is used to weight the input signal toward the tonal content. In the decoder, the inverse filter reverses the weighting.

This technique has a few clever advantages. It wraps the preexisting CELT encoder/decoder and so the additional complexity is compartmentalized. In addition, the comb filter weights the entire harmonic structure rooted in the fundamental, not just the fundamental frequency itself.

The primary disadvantage of the technique is that it requires better than typical pitch detection. In speech codecs like Speex, pitch halving or doubling affects the efficiency of coding prediction but does not have serious consequences for encoded quality. In CELT, however, this would potentially generate phantom harmonics or drop the even harmonics of the fundamental, both obviously audible problems. As such, CELT requires (and of this writing has) a more reliable pitch predictor than typical.

This technique also currently only applies to a single fundamental. At present, Vorbis still eats CELT's lunch on strongly tonal polyphonic samples.

synthetic pseudorandom noise

Recently, CELT is also able to signal that a band be filled with pseudorandom noise. Bands of pure noise (without substantial features in either frequency or time) is a more common occurrence in real-world audio than most people realize.

CELT timeline

The first very early next-generation audio codec work to develop a successor to Vorbis began at Xiph in 2005. In 2007 I began more directed research on a new codec (Ghost) along with Jean-Marc Valin of Xiph's Speex project. He felt strongly that low-delay was an important feature in a new codec. At that time, we didn't reconcile the low-delay requirements with the filterbank topology I wanted to use in Ghost.

Shortly after, I decided that improving the Theora encoder was a much more pressing concern than next-gen audio development and turned my attention to the Thusnelda encoder. Jean-Marc, however, was free to continue working on the low-delay codec (CELT) that was born out of those 2007 meetings. CELT is currently nearing completion.

29-Nov-2007: Initial commit
08-Dec-2007: CELT 0.0.1 release
10-Dec-2007: Implement joint stereo
09-Jan-2008: Implement M/S stereo
15-Jan-2008: CELT 0.0.2 release
03-Feb-2008: CELT 0.1.0 release
11-Feb-2008: Implement initial band folding
22-Feb-2008: CELT 0.2.0 release
19-Mar-2008: CELT 0.3.0 release
26-Apr-2008: CELT 0.3.1 release
16-May-2008: CELT 0.3.2 release
23-May-2008: Implement intensity stereo
16-Jun-2008: Implement time envelopes
17-Jun-2008: Implement short blocks
26-Jul-2008: CELT 0.4.0 release
10-Oct-2008: CELT 0.5.0 release
17-Dec-2008: CELT 0.5.1 release
17-Feb-2009: CELT 0.5.2 release
21-May-2009: Implement VBR
04-Jul-2009: CELT 0.6.0 release

13-Jul-2009: CELT 0.6.1 release
26-Oct-2009: CELT 0.7.0 release
20-Jan-2010: CELT 0.7.1 release
21-Feb-2010: Add SILK hybrid mode hooks for OPUS
07-May-2010: Land frame-to-frame dynamic framesize
21-May-2009: Move band splitting into quant instead of after PVQ
27-May-2010: Adaptive Time/Frequency resolution via Hadamard
02-Jul-2010: CELT 0.8.0 release
08-Jul-2010: CELT 0.8.1 release
24-Jul-2010: Implement pseudorandom noise synthesis
05-Aug-2010: Remove old pitch prediction
30-Sep-2010: Implement dynamic bit allocation [tilt/boost]
18-Oct-2010: Eliminate MDCT weight and time envelope
04-Nov-2010: Implement comb filter (off by default)
06-Nov-2010: CELT 0.9.0 release
08-Nov-2010: CELT 0.9.1 release
05-Dec-2010: Implement unconstrained VBR
21-Dec-2010: CELT 0.10.0 release
January-2011: Projected bitstream freeze

Abandoned techniques

In the interest of documenting several dead-ends...

pitch-prediction/warping

Early CELT used a pitch prediction/warping scheme to try to improve tonal coding that was completely different from the current comb filter approach. It was complex, expensive, and of marginal effectiveness. It was finally removed from the code November 9th, 2009.

The original technique is described in detail in the original CELT paper. To summarize, the old pitch prediction searched backwards in time through previously seen data, using cross-correlation to search for a candidate match. The idea behind this technique was that the correlation search would find both the period of the fundamental as well as a preceding window that predicted the spectrum of the current window well. Although the short window of CELT prevents resolving harmonics clearly, the theory was that finding a preceding match would also have a similar harmonic spread and thus also provide a good predictor for the unresolvable harmonics in the current frame as well.

Unfortunately, results were mixed; although the predictor was better than nothing in many cases, its utility could best be described as marginal, and it was decided it was certainly not worth the computational cost.

MDCT weighting

Short blocks in CELT are coded as if they're a normal frame with spectral values from each short MDCT interleaved to produce a full frame's worth of spectral data; the data is encoded as if it were the product of a single MDCT. Because short blocks are used only for impulsive frames, the energy levels of of the data from each MDCT may vary by quite a large amount. For a period of time, CELT was able to weight the MDCTs of individual short blocks to equalize the energies from before and after the impulse. The idea sounds obviously useful, but proved to be of dubious utility (and required signaling bits to use). MDCT weighting has been disabled for some time and was finally dropped from the code on October 18, 2010.

Time envelope

The time envelope was an attempt to equalize the energy before and after impulses within a block as a means of controlling/preventing pre-echo. Initially they were an attempt to avoid short blocks, and then an attempt to augment short blocks. The technique failed for several reasons, as had similar attempts early in the development of Vorbis. The primary problems with using time domain envelopes to control pre-echo were:

The spectral shape preceding and following transients is typically very different; simply normalizing energy does not prevent the dissimilar spectrums from bleeding together. In short, pre-echo still happens.
The audible effects of time envelopes even when triggered correctly could sound almost as bad as the pre-echo it tried to prevent. The envelope adjustment tended to cause dropouts preceding impulses, a curious anti-pre-echo that sounded just as incorrect as pre-echo itself.

Demonstration samples

Here I present a few canned CELT demos, decoded to WAV so that it is possible to evaluate and compare without needing to build or install myriad old versions of the CELT decoder. Feel free to download the samples and use a comparison application (such as Xiph.Org's squishyball comparison/testing utility, pictured left) for casual or more rigorous comparison.

The original uncompressed sample can be downloaded here. The complete set of demo samples below is directly downloadable here. These encodes are remarkably low-rate for a general purpose low-latency codec; I've done this on purpose to make the differences easy to hear.

Be aware that Firefox has a playback bug that might cause clicking during playback. It's a browser problem, not part of the samples, and it's fixed in the 4.0 beta prereleases.

A quick showcase of the bleeding edge...

The usual, fairly boring but well known compilation test sample encoded at varying bitrates and latencies, using CELT 0.10.0.

CELT 0.10.0: 22.5ms total latency, various bitrates

CELT 0.10.0 @ constant PEAQ value, varying latency

Playback: [ uncompressed | 22.5ms (64kbps) | 12.5ms (70.4kbps) | 7.5ms (84.8kbps) | 5ms (112kbps) ]

Download: [ uncompressed | 22.5ms (64kbps) | 12.5ms (70.4kbps) | 7.5ms (84.8kbps) | 5ms (112kbps) ]

Improvement of CELT over time

The same sample encoded by major releases of CELT over the past three years. Note that the earliest versions of CELT supported a maximum frame size of 5ms (7.5ms total delay) so some of the improvement in the quality of later releases is a direct result of supporting larger windows. Later samples use 20ms windows (22.5ms total delay).

CELT @ 64kbps, best available quality, various versions

Playback: [ 0.1.0 | 0.2.0 | 0.3.2 | 0.4.0 | 0.5.2 | 0.6.1 | 0.7.1 | 0.8.1 | 0.9.1 | 0.10.0 (latest) | uncompressed ]

Download: [ 0.1.0 | 0.2.0 | 0.3.2 | 0.4.0 | 0.5.2 | 0.6.1 | 0.7.1 | 0.8.1 | 0.9.1 | 0.10.0 (latest) | uncompressed ]

CELT as compared to AAC [LC/HE/HEv2/LD] and Vorbis

CELT's primary initial design goal was low-latency operation, targeting the same approximate niche as AAC's 'LD' (Low Delay) profile. CELT offers total latencies of 5ms through 22.5ms, where AAC-LD offers a minimum total latency of down to ~20ms.

As CELT's bitrate performance improved, however, it also became natural to compare it to high-latency general purpose codecs such as AAC-LC (intended for low complexity, >100ms latency), HE-AAC v1 and v2 (intended for very low bitrates, latency >200ms) and Vorbis (general purpose, latency >200ms).

The only AAC-LD encoder I could find (Quicktime Pro) offers down to 64kbps for 48kHz stereo, so it does not appear in the less-than-64kbps comparisons below.

CELT, AAC and Vorbis @ 64kbps 48kHz stereo

CELT, AAC and Vorbis @ 48kbps 48kHz stereo

CELT, AAC and Vorbis @ 32kbps 48kHz stereo

CELT, SILK, Opus and the IETF codec WG

Working with Skype and others within the IETF, Xiph drove the creation of a working group to produce a royalty-free codec for general-purpose internet usage, including telepresence. Several codecs have been submitted as working material, including CELT. A combination of CELT and Skype's SILK codec has been adopted as the primary development target of the working group.

Skype's SILK codec is a state of the art codec for low to moderate bitrate speech (yes, it's better than Speex). It isn't great for music, and it doesn't do very high quality, but it's fantastic for an important set of applications. The combination with CELT gives good performance from 6kbit/sec to transparency.

A few current CELT users

mumble
teamspeak
fmod
numerous radio station transmitter links (long a stronghold of MPEG layer 2)
wireless audio hardware [that sadly can't be named]

Monty