Overview

Hydrogenaudio conducted a 64kbit/sec multiformat listening test including Opus, aoTuV Vorbis, two HE-AAC encoders, and a 48kbit/sec AAC-LC low anchor. The official results are available at http://listening-tests.hydrogenaudio.org/igorc/results.html.

The test compared 30 diverse samples including 15 known-difficult samples used in prior HA listening tests and another 15 samples selected by the test organizers. The highly sensitive ABC/HR methodology was used, where every codec under consideration is paired with a hidden copy of the original audio in order to prevent and detect cases where listeners are just guessing on samples which are close to perfect. Listeners rated the codecs for their impairment on a scale of "Imperceptible"(5) to "Very annoying" (1).

This page includes an unofficial presentation of the results, complete with audio file playback, and some supplementary analysis by the developers of Opus.

From the Xiph.org developers

Opus is a codec designed for interactive usages, such as VoIP, telepresence, and remote jamming, that require very low latency. In this test Opus is running with 22.5ms of total latency but the codec can go as low as 5ms. Making a codec for low latency requires serious tradeoffs which reduce efficiency, so it might seem a bit strange to test it against a collection of state-of-the-art codecs which are completely unsuitable for these low-latency applications.

When we started working on Opus (then known as CELT), we used the slogan "Why can't your telephone sound as good as your stereo?" and we weren't kidding. While we never expected in our wildest dreams to 'sound as good as your stereo' at the same bitrate, we tried to get as much efficiency as possible. When we exceeded the performance of MP3 (an older generation high-latency codec) in our first formal test several years ago we considered it a fantastic success, and we were later surprised when we found our low-bitrate result besting Vorbis. Now, these results demonstrate that Opus's performance against HE-AAC, one of the strongest (but highest-latency) codecs at this bitrate, are very strong, besting the quality of two of the most popular and respected encoders for the format on the majority of individual audio samples and receiving a higher average score overall.

Considering Opus's success in this test, perhaps we should have asked "Why can't your stereo be as interactive as your telephone?" instead. This kind of convergence is already possible due to multimedia-ready web-browsers, mobile phones which are really mobile computers, and the expanding reach of high-speed Internet. Opus will provide the standardized royalty-free format needed to unleash the broad potential of low-delay, high-quality, multi-party audio.

Opus isn't finished yet—the bitstream is in a soft-freeze, where we're trying not to break compatibility gratuitously. The process of finalizing Opus depends on the progress of the IETF codec working group where we've been collaborating on it with technologists from a broad cross-section of perspectives and potential uses. The IETF uses an open, consensus-based process, and more people are always welcome to come and help us finish the work. Developers who would like to incorporate Opus into their applications are particularly encouraged to join.

All results averages

This is the mean results of all valid submissions.
Opus 3.999 Apple_HE-AAC 3.817 Nero_HE-AAC 3.547 Vorbis 3.513 AAC-LC@48k 1.656
Note: The test used an unbalanced methodology where some samples gathered different numbers of listeners.
If these results are combined directly it will bias the result towards the samples which received more listeners. The bias is pretty small, but we decided to correct for it here.
This graph corrects for that bias by blocked boostrapping for the missing samples from the values provided by the other listeners. The confidence intervals are generated from the bootstrapped values.


Complete listeners averages

This is the mean results from the 10 listeners who submitted valid results for all 30 samples.
Opus 3.924 Apple_HE-AAC 3.679 Nero_HE-AAC 3.341 Vorbis 3.270 AAC-LC@48k 1.395

Bitrate summary

This test was an unconstrained VBR test. Each codec was adjusted to achieve the same average rate on some large corpus. Encoders are free to use more or less rate on each sample according to their own analysis, better VBR encoders will give more bits to more difficult tracks. Because the samples used in this test are fairly difficult ones it is expected the the encoders should tend to higher rates.

The boxplot provides some insight into how much of the performance differences were due to rate control differences. Because the Opus encoder was mostly designed for CBR and tightly constrained VBR it doesn't make much use of VBR here.

These figures are overall file sizes vs duration, so they include all container overheads.

Per-sample condorcet results

In this analysis a    condorcet winner(+) is a codec which wins at least one and wins or is statistically insignficant (p>0.05) in every pairwise comparison, a    condorcet loser(-) is a codec which loses at least one and loses or is statistically insignficant (p>0.05) in every pairwise comparison, and a    mixed result(%) is a codec which has both statistically significant wins and losses.

Significance is measured here using a permutation test on each sample for all listeners. The low anchor is excluded. Within each sample the significance is compensated for multiple comparisons, but the overall chart is currently not. As a result the significance shown may be somewhat exaggerated, though most of them are unchanged even under a worst case correction.

Sample 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Opus  
-
-
+
+
+
+
+
+
?
+
+
+
+
+
+
+
+
+
-
%
+
+
+
+
?
-
?
%
?
+
Apple HE-AAC  
+
+
%
+
%
-
-
-
?
-
?
-
-
?
-
%
+
%
+
+
-
+
-
?
+
+
?
+
+
+
Nero  HE-AAC  
+
+
-
-
-
+
-
-
?
%
-
-
-
?
-
-
-
-
%
%
-
-
-
-
-
+
?
%
?
-
Vorbis  
-
%
-
-
%
-
+
?
?
+
?
-
-
-
-
%
-
-
-
-
?
?
?
-
-
-
?
-
-
+


Per-sample averages

Clicking on a sample number or average will allow you to hear the sample if your browser has wav support (e.g. Firefox 4) and you have enough bandwidth (or patience) to stream uncompressed audio.

Sample 01 02 03 04 05
Opus    2.63
 
 2.30
 
 4.72
 
 4.61
 
 4.47
 
Apple HE-AAC    4.16
 
 3.95
 
 3.93
 
 4.49
 
 3.77
 
Nero HE-AAC    4.22
 
 4.14
 
 3.48
 
 3.78
 
 3.46
 
Vorbis    2.51
 
 2.90
 
 3.34
 
 4.04
 
 4.11
 
AAC-LC@48k    1.78
 
 1.57
 
 2.44
 
 2.10
 
 1.86
 
Sample 06 07 08 09 10
Opus    3.90
 
 3.43
 
 4.47
 
 3.89
 
 4.04
 
Apple HE-AAC    3.17
 
 2.85
 
 4.04
 
 3.83
 
 1.91
 
Nero HE-AAC    3.54
 
 2.87
 
 3.72
 
 3.65
 
 2.56
 
Vorbis    3.19
 
 3.81
 
 3.97
 
 3.53
 
 4.21
 
AAC-LC@48k    1.42
 
 1.84
 
 2.05
 
 1.53
 
 1.36
 
Sample 11 12 13 14 15
Opus    4.36
 
 4.05
 
 4.54
 
 4.22
 
 4.75
 
Apple HE-AAC    4.12
 
 3.00
 
 3.79
 
 3.94
 
 4.07
 
Nero HE-AAC    3.87
 
 2.99
 
 3.54
 
 3.73
 
 3.77
 
Vorbis    4.07
 
 3.07
 
 3.76
 
 3.39
 
 4.06
 
AAC-LC@48k    1.74
 
 1.62
 
 1.44
 
 1.39
 
 1.98
 
Sample 16 17 18 19 20
Opus    4.35
 
 4.41
 
 4.05
 
 3.82
 
 4.22
 
Apple HE-AAC    3.00
 
 4.10
 
 3.55
 
 4.87
 
 4.66
 
Nero HE-AAC    2.63
 
 3.17
 
 2.85
 
 4.37
 
 4.04
 
Vorbis    3.92
 
 3.25
 
 2.56
 
 3.58
 
 3.30
 
AAC-LC@48k    1.77
 
 1.59
 
 1.54
 
 1.91
 
 1.65
 
Sample 21 22 23 24 25
Opus    4.39
 
 4.21
 
 4.03
 
 4.11
 
 3.77
 
Apple HE-AAC    3.46
 
 4.22
 
 3.41
 
 3.81
 
 4.32
 
Nero HE-AAC    3.40
 
 3.18
 
 3.02
 
 3.51
 
 3.75
 
Vorbis    3.73
 
 3.72
 
 3.49
 
 3.18
 
 3.55
 
AAC-LC@48k    1.40
 
 1.14
 
 1.19
 
 1.38
 
 1.59
 
Sample 26 27 28 29 30
Opus    2.88
 
 3.86
 
 3.89
 
 4.03
 
 4.66
 
Apple HE-AAC    4.13
 
 3.86
 
 4.35
 
 4.31
 
 4.28
 
Nero HE-AAC    3.93
 
 3.81
 
 3.96
 
 4.06
 
 3.48
 
Vorbis    2.86
 
 3.54
 
 3.28
 
 3.51
 
 4.39
 
AAC-LC@48k    1.61
 
 1.36
 
 1.49
 
 1.57
 
 1.86
 

Per-sample distribution

This boxplot shows the distribution of scores per sample averaged over all listeners.

Raw data

2011_multiformat_64kbit_test.tar.bz2
In addition to the decrypted raw results, this file includes the statistical analysis used here, as well as parsed extracts of the data suitable for further analysis.

Thanks

We'd like to thank Igor Dyakonov for conceiving and operating this test and Gian-Carlo Pascutto for working with Igor on test parameters and the analysis. The listeners especially deserve recognition: the 33 listeners listened to 6,180 audio clips, made 3,090 measurements, and (assuming a conservative 10 minutes per track) spent a total of 103 hours on this test. Without the hard and careful work of all the involved parties we wouldn't be able to have these clear and significant results.
Greg Maxwell (greg@xiph.org)