Mans Rullgard has written two long rants about the Ogg container in
the past few years. One made it to Slashdot apparently based on the drama potential alone. If
you don't know what I'm talking about below, don't worry about it,
I'd not originally intended to respond to open trolling. The
continued urging of many individuals has convinced me it's important
to rebut in some public form. Earnest falsehoods left unchallenged
risk being accepted as fact.
Addressing each set of assertions inline:
The Ogg container format is being promoted by the Xiph Foundation for
use with its Vorbis and Theora codecs. ..........................
By way of clarification, Ogg is for all codecs. That
said, as a United States-based non-profit, the Xiph.Org Foundation
charter strongly suggests we restrict ourselves to advocating unencumbered technologies.
...................................... Unfortunately, a number of
technical shortcomings in the format render it ill-suited to most, if
not all, use cases. This article examines the most severe of these
As I show below, the article fails to establish any examples of
such flaws, except by rote assertion and spurious logic.
Overview of Ogg
The basic unit in an Ogg stream is the page consisting of a header
followed by one or more packets from a single elementary stream. A
page can contain up to 255 packets, and a packet can span any number
of pages. The following table describes the page header.
Field Size (bits) Description
capture_pattern 32 magic number "OggS"
version 8 always zero
granule_position 64 abstract timestamp
bitstream_serial_number 32 elementary stream number
page_sequence_number 32 incremented by 1 each page
checksum 32 CRC of entire page
page_segments 8 length of segment_table
segment_table variable list of packet sizes
Elementary stream types are identified by looking at the payload of
the first few pages, which contain any setup data required by the
decoders. For full details, see the official format specification.
This description of an Ogg page is accurate. The description and
fields are easy to verify against the published Ogg spec description at Xiph.Org
and the RFCs on the Ogg format.
Ogg, legend tells, was designed to be a general-purpose container
"Legend tells us"? Ogg is not a dramatic, unknowable mystery
shrouded in the mists of time. I designed it. I'm alive and willing
to answer any questions about the format. Allow me this opportunity
to reiterate that Ogg was designed as a general purpose container.
....... To most multimedia developers, a general-purpose format is one
in which encoded data of any type can be encapsulated with a minimum
The Ogg format defined by the specification does not fit this
description. For every format one wishes to use with Ogg, a complex
mapping must first be defined. ......................................
Some further elaboration from the horse's mouth: Mapping is a term
I coined for the process of formally documenting how a codec will be
placed into a container. Every container involves details beyond
'plop raw compressed frames into the container and you're done.' Some
details include specifying codec magic (eg, the "FOURCC" in AVI, the
'Magic' in Ogg), choosing an appropriate timebase (or how to convert
to the container's timebase), how one indicates keyframes/sync points,
how this data is submitted to the container, and so on. Mappings also
allow a given codec to take targeted advantage of the features offered
by a particular container. One example is mp3 in Matroska, where the
mapping specifies that the mp3 header is to be treated as
duplicated/compressed data. Mappings need only be specified once and
By definition, mapping must be done for any codec into any
container, even if the mapping is relatively trivial. This is true of
MP4/MOV, Matroska, Ogg, NUT, AVI, and every other container. Some
containers, like Ogg and Matroska, explicitly
describe and document mapping, as well as the codec mappings
themselves. Other containers document mappings but have no explicit
name for it. A few remainders like AVI neither institutionalize the
process of mapping, nor reliably document how codec data is contained,
leading to an 'anything goes' situation of widespread ambiguity and
In short, every container has codec mappings whether they are
explicit or implicit or even well-formed. The Ogg project has a name
for the process. It is disingenuous to claim that Ogg is inferior to some
other container that requires these same decisions, but has no name
for the process, or worse, no process at all.
.............................. This mapping defines how to identify a
codec, how to extract setup data, and even how timestamps are to be
interpreted. All this is done differently for every codec. ..
It would be silly to do it over and over if it was the same every
correctly parse an Ogg stream, every such mapping ever defined must be
This is commonly asserted by detractors, but a combination of
false and missing the point.
Ogg transport is based entirely on the page structure primitive,
described accurately above. There are no other structures in the
container transport itself. Higher level structures are built out of
pages, not built into them. All Ogg streams conform to this page
structure and all Ogg streams are parseable and demuxable
without knowing anything about the codec. "Drop the needle" anywhere
in an Ogg stream and start demuxing; you get the codec data out
without knowing anything about the codec. You possibly won't know
what exactly to do with that data without the codec mapping and the
data is possibly useless without the codec anyway, but that's true of
To avoid being accused of sidestepping the issue, I posit that the
actual [if unstated] objection is that the Ogg container does not
fully specify the granule position in the transport specification.
Beyond a few requirements, a codec mapping defines the granule
position spec for that codec's streams, not the Ogg spec. In theory,
this would mean that without codec knowledge or some other place to
find the granule position definition, a decoder missing the codec for
a given stream would not be able to determine the timestamp on the
stream that it is not capable of decoding anyway. In practice, the
granule position mapping does in fact exist in the stream metadata
within the Skeleton header (as it would be in
Matroska or NUT). Additionally, the Ogg design allows implementations
to ignore the pretty design theory and just do things the way other
containers do by building granule position calculation into the mux
There's specific considered reasons for the granulepos design which
take some space to explain accurately. Because Mr. Rullgard also wrote
a lengthy diatribe against Ogg timestamping, I'll
leave the explanation for there and link to it here when my response
to the other article is live.
Under this premise, a centralised repository of codec mappings would
seem like a sensible idea, but alas, no such thing exists. It is
simply impossible to obtain a exhaustive list of defined mappings,
which makes the task of creating a complete implementation somewhat
The mappings exist, they are not held all in one place. As we do
not control all the codecs, we have not sought to control all the
mappings. It's also not clear that we should hold or promote
mappings for encumbered codecs (as per charter).
However, a centralized repository for
mappings is an obviously desirable thing. At present, codec mappings
are documented in the codec specifications themselves. A page of
simple links, which we should have, would address your objection.
Thus it is hardly a "severe flaw" in the container.
One brave soul, Tobias Waldvogel, ..................................
Brave soul? Was Tobias single-handedly staring down a Xiph panzer
division as he did this?
................................. created a mapping, OGM, capable of
storing any Microsoft AVI compatible codec data in Ogg files. This
format saw some use in the wild, but was frowned upon by Xiph, and it
was eventually displaced by other formats.
OGM used the Ogg page structure (mostly correctly) though with
private data for the VfW framework. The result was parseable as Ogg
container but containing an ugly Windows-specific hack. We objected
because it was not well formed and confused users who thought it was
regular Ogg. It was a quick and dirty fork.
For the record, Tobias later joined Xiph along with his DirectShow
filters and deprecated OGM. OGM is no longer supported in our
True generality is evidently not to be found with the Ogg format.
The ad-hoc 'evidence' above fails to justify this conclusion.
A good example of a general-purpose format is Matroska. This container
can trivially accommodate any codec, all it requires is a unique
string to identify the codec. ..................................
In summary, mappings are a serious flaw in Ogg, but an advantage in
Matroska? Matroska mappings go into considerably more detail than a
FOURCC string, as is implied above.
The problem with Matroska mappings is not that they exist, but that
they are not nearly detailed enough. This is not a flaw of the
Matroska container, merely the documentation, and I am certainly not
innocent of inadequate documentation myself. Ogg documentation is just
as bad and in places much worse. I assert that the single largest
problem in both Ogg and Matroska is the lack of sufficiently detailed,
high-quality documentation. Both projects describe what the container
is and how it is formatted. Neither project sufficiently documents
the proper way to use it.
............................. For codecs requiring setup data, a
standard location for this is provided in the container. ............
...as in Ogg. From the Ogg bitstream documentation, the stream starts with:
- The initial header for each stream appears in sequence, each
header on a single page. All initial headers must appear with no
intervening data (no auxiliary header pages or packets, no data
pages or packets). Order of the initial headers is unspecified. The
'beginning of stream' flag is set on each initial header.
auxiliary headers for all streams must follow. Order is unspecified.
The final auxiliary header of each stream must flush its page.
- Data pages for each stream follow, interleaved in time order.
an official list of codec identifiers is maintained, meaning all
information required to fully support Matroska files is available from
Detailed documentation (or the lack thereof) is vitally important,
however it has little to do with the container design itself.
Mr. Rullgard claims to establish that Ogg is badly flawed, not that it
needs more documentation.
Matroska also has probably the greatest advantage of all: it is in
active, wide-spread use. .............................................
Ogg and Matroska share this advantage, though deployment only
slightly overlaps. Ogg, Vorbis and Theora are all in silicon and
firmware on countless portable devices. Matroska has seen penetration
into the home DVD player market. Both have nearly universal support
in third-party software players.
........................ Historically, standards derived from existing
practice have proven more successful than those created by a design
I'm not sure what this is meant to imply-- that I have multiple
personalities? I need a t-shirt that says "I AM COMMITTEE".
Ogg wasn't the product of a committee. I designed it.
That said, h.264 is the result of possibly the largest committee
the world has ever known. I think we all agree it's a great format,
even if many of us object to the thousands of patents involved.
Lastly, the critique so far does not mention or enumerate ways in
which Ogg breaks with established practice. Ogg is modelled loosely
on a simplified MPEG-TS/PS design. All the design elements, including
the ones to which Mr. Rullgard objects, appear at some point in other
When designing a container format, one important consideration is that
of overhead, i.e. the extra space required in addition to the
elementary stream data being combined. For any given container, the
overhead can be divided into a fixed part, independent of the total
file size, and a variable part growing with increasing file size. The
fixed overhead is not of much concern, its relative contribution being
negligible for typical file sizes.
As with the last section, the overhead discussion begins with a few
basic facts nearly anyone can agree with.
The variable overhead in the Ogg format comes from the page headers,
mostly from the segment_table field. This field uses a most peculiar
encoding, somewhat reminiscent of Roman numerals. ...............
It is a 'most peculiar' encoding, designed to have near-constant
overhead regardless of packet size. It is so ludicrous that Matroska
also adopted it. Atamido of #matroska estimated
in IRC that approximately half of Matroska streams use the "Xiph
lacing" (I assume this is a very round estimate, but it indicates that
the Matroska designers do not consider it so peculiar).
................................................. In Roman times,
numbers were written as a sequence of symbols, each representing a
value, the combined value being the sum of the constituent values.
The segment_table field lists the sizes of all packets in the
page. Each value in the list is coded as a number of bytes equal to
255 followed by a final byte with a smaller value. The packet size is
simply the sum of all these bytes. Any strictly additive encoding,
such as this, has the distinct drawback of coded length being linearly
proportional to the encoded value. A value of 5000, a reasonable
packet size for video of moderate bitrate, requires no less than 20
bytes to encode.
Though correct, this does not explore why such an encoding might
The issue with a typical variable length encoding that extends off
of the leading bits is that you burn at least a full bit of range even
in the shortest length encodings. Let's look at the EBML value
encoding in Matroska:
1xxx xxxx - value 0 to 2^7-2
01xx xxxx xxxx xxxx - value 0 to 2^14-2
001x xxxx xxxx xxxx xxxx xxxx - value 0 to 2^21-2
0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^28-2
0000 1xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^35-2
0000 01xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^42-2
0000 001x xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^49-2
0000 0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx - value 0 to 2^56-2
It's not that one leading bit is expensive, it's that it reduces the
max length of a one-byte encoding from 254 to 126 (in Matroska, 0xff
is reserved, so the max value is 2^x-2 not 2^x-1). For example, if a
leading bit of '0' signifies 'one byte of length' and a bit of '1'
means 'extend to more bytes', then any length greater than 126 bytes
uses two bytes of length encoding.
This boundary turns out to be somewhat significant. Most low-rate
audio codecs tend to hover right around or just above this value, and
even video easily goes this low. So, using the NUT length encoding or
the Matroska EBML encoding, you nearly always add an extra byte to
each packet's length encoding in low rate streams. When you're
coding, eg, 150 byte packets, overhead due to Ogg lacing is 0.67% per
packet. Using Matroska EBML or NUT encoding, the length-encoding
overhead is 1.3% per packet. This is why Matroska also uses the Xiph
That said, transOgg [the next rev of Ogg; if the Google VP8
leak is true, we'll have some breathing room soon to start more
aggressively developing it] will use a lacing that pivots off of value
252 rather than 255. In this way, we still avoid 'wasting' an entire
bit of numerical range when extending, but we avoid the runs of 255 to
which Mr. Rullgard objects. And, it's truly the best of both worlds,
which means there's no need for multiple optional encodings.
On top of this we have the 27-byte page header which, although paling
in comparison to the packet size encoding, is still much larger than
necessary. Starting at the top of the list:
No comparisons are offered against other formats. So far this all
has implied that Ogg's overhead is ludicrously high compared to other
containers. That's not the case; Ogg is among the lower-overhead
containers, yet guaranteed to be inherently streamable. A streamable
structure burns bytes as illustrated below, but Ogg overhead today
still hovers at around 0.6-0.7% for high-rate video and can always
capture from any point in under 128kB (usually around 4kB in
practice). For low rate anything, no other container currently
matches its efficiency. This is one reason the length encoding is
still the way it is; even with length-encoding runs of 255, it is an
insignificant enough thing that no one sane had previously cared.
* The version field could be disposed of, a single-bit marker
being adequate to separate this first version from hypothetical
future versions. One of the unused positions in the flags field
could be used for this purpose
Disposing of the version field may be a reasonable suggestion (eg,
the upcoming transOgg does omit the version field), though no
justification or pro/con is explored. The idea of a bit marker is
similarly not justified or explained. This indicates that Mr. Rullgard
is unaware what the field was actually for.
The version field had originally been intended to allow multiple
Ogg page types tuned for different payloads to coexist in the same
stream. The Ogg container format froze much earlier than the Ogg
codecs did, and as the 2000s wore on it became clear we would only
ever use one page version (version zero). Again, the contribution to
overhead was negligible and it was left as is rather than break spec
and require every adopter to upgrade (a difficult thing when an
implementation you paid for is in hardware).
Moving forward, using a versioned capture pattern is perhaps more
sensible and this is the transOgg approach. Discussion of versioned
capture will be part of the transOgg docs.
* A 64-bit granule_position is completely overkill. 32 bits would
be more than enough for the vast majority of use cases. In
extreme cases, a one-bit flag could be used to signal an
extended timestamp field.
Presupposing that the granule position is intended only to be a
timestamp (which is not the case), 64 bits is hardly overkill as
practical use has demonstrated regularly. Similarly, using 64 bits
rather than 32 eliminates a conditionally triggered mechanism. Though
variable length and optional fields are not evil, there's no reason to
use them indiscriminately either. At some point, every unnecessary
mechanism just contributes to bug count.
However, the granule position is not simply a timestamp. It is a
synthetic value that encodes DTS, PTS and distance to first-needed
reference. The suggestion that it should be reduced from 64 bits
ignores a substantial portion of the Ogg design. Muxing, seeking, and
verification are all designed on top of the granule position
construct. Completely missing this design aspect demonstrates
For comparison purposes, Matroska Cluster timecodes are explicitly
declared in EBML. To signal presence of a timecode, two additional
bytes must be used. In this manner, Matroska timecodes are six bytes
even when storing only 32 byte values. To store a 64 bit timecode, 80
bits must be used. I do not consider Matroska's encoding overhead to
* 32-bit elementary stream number? Are they anticipating files
with four billion elementary streams? An eight-bit field, if not
smaller, would seem more appropriate here.
The stream ID is intended to be used like a weak hash. If stream
ID numbers collide in a muxing or concatenation operation, altering
the stream ID number requires renumbering every page in the stream (this
would be the case in any other container as well), and also requires
the checksum on every page be recomputed. Having a large
pseudo-random ID space makes such collisions vanishingly unlikely,
eliminating the need for continuous recalculation of page headers at
every muxing step.
Recall that the Ogg design treats the pages of a stream like a deck
of cards; one multiplexes two streams by shuffling two decks
together with no other changes, making muxing and demuxing a
nearly trivial operation that can be performed on-the-fly with nearly
zero CPU on live streaming servers. The large stream ID is part of
* The 32-bit page_sequence_number is inexplicable. The intent is
to allow detection of page loss due to transmission errors. ISO
MPEG-TS uses a 4-bit counter per 188-byte packet for this
purpose, and that format is used where packet loss actually
happens, unlike any use of Ogg to date.
A 32 bit sequence allows direct UDP unicast/multicast with Ogg
handling reordering and reassembly. The biggest reason for such a
large number is that the Ogg granule position can't be relied upon for
sequencing/ordering, especially when using low MTUs in which Ogg
packets could span UDP packets (and thus Ogg pages). The 32 bit
sequence also allows keyed encryption without continuous rekeying, or
moderate-length stream interruptions resulting in permanent loss of
keying/capture when the sequence number rolls over. The sequence
number is also used, as you state, for gap detection in other cases.
In transOgg, we're exploring the use of an extended granule
position that replaces the sequence field both for gap detection and
UDP ordering. It's not clear yet that will be enough.
* A mandatory 32-bit checksum is nothing but a waste of space when
using a reliable storage/transmission medium. Again, a flag
could be used to signal the presence of an optional checksum
The checksum is part of the capture mechanism. I will note that
the NUT container (contributed to by Mr. Rullgard) uses a 64 bit
capture pattern. Ogg uses a 32 bit capture + 32 bit checksum for a
total of 64 bits. The captures have equivalent behavior, but Ogg also
gets error detection out of it.
Very occasional corruption does happen both in network transmission
and local file storage. I have personally had files corrupt due to
decayed spinning media. It is incorrect to claim that it never
With the changes suggested above, the page header would shrink from 27
bytes to 12 bytes in size.
It would also gut the container functionality. Assuming 4kB pages
(which is approximately what would be used in practice for audio with
low-rate video), the loss of functionality gains back 0.35%
overhead. For high-rate video (as page size climbs) the 'advantage'
from adopting these suggestions eventually drops to 0.02%. This is
cutting off your nose to spite your face.
For comparison purposes, an Ogg page is the conceptual equivalent
of a Matroska Cluster + the SimpleBlocks inside. A minimal Matroska
Cluster (containing only a single SimpleBlock, only one frame, no
checksum, 64 bit presentation timestamp, no references, no
optional/auxiliary fields or features, no data) is 23 bytes.
An Ogg page header is always 27 bytes, but it also provides
sequencing, CRC, strong capture, gap detection, DTS, and codec delay.
The actual overheads seen depend on the relative size (and thus
header frequency) of Matroska Blocks and Clusters compared to Ogg
pages, so header size alone means little.
We thus see that in an Ogg file, the packet size fields alone
contribute an overhead of 1/255 or approximately 0.4%. This is a hard
lower bound on the overhead, not attainable even in theory. In reality
the overhead tends to be closer to 1%.
Basically correct except that the practical overhead of Ogg files,
using libogg 1.2 as the muxer, is typically 0.6%-0.7% across the board.
Contrast this with the ISO MP4 file format, which can easily achieve
an overhead of less than 0.05% with a 1 Mbps elementary stream.
MP4 overhead climbs to Ogg levels when an MP4
file is remuxed such that it can be streamed (played via progressive
download), otherwise the file must be downloaded completely before
playback can begin. It is also not possible to stream live in MP4 at
all; the bitstream format simply does not have the feature. Corrupt
the index on an ultra-low overhead MP4 muxing, and you stand to lose
the whole file. In summary, such tight muxing has significant
It is also odd to compare Ogg to a file format that is missing
features that render it unusable in the situations for which Ogg was
designed and is currently being used (eg, live streaming). Comparing
to Matroska is more reasonable. Matroska can stream, both live and in
progressive download, though the file may need to be muxed with
streaming in mind. In this situation, Matroska and Ogg overheads are
roughly comparable. The 'winner' depends on bitrate and mux latency.
transOgg, which will use the new lacing described above,
retains Ogg's 'always streamable' design and currently reduces
theoretical minimum overhead to 0.035%. Like with other low-overhead
containers, this number is achievable but probably not in any truly
useful case. When comparing apples to apples, most of the containers
in wide use today have similar overhead numbers even when the
theoretical minimums vary widely.
In many applications end-to-end latency is an important
factor. Examples include video conferencing, telephony, live sports
events, interactive gaming, etc. With the codec layer contributing as
little as 10 milliseconds of latency, the amount imposed by the
container becomes an important factor.
It is jarring to complain about high overhead, then immediately
demand low-latency performance. The same container typically is not
used in both low-overhead and low-latency applications as overhead and
latency are a nearly direct tradeoff. Low latency containers (such as
MPEG-TS, or if you think about it as a container, RTP) are all
fantastically high overhead. It is not absurd for an RTP stream, for
example, to exceed overhead figures of 25%. It is inescapable.
Ogg is not optimal for low and ultra-low latency applications,
though it can still be used effectively just as can any of the other
low-overhead containers (except MP4; it can't stream live at all).
The overhead figures will be relatively high for all of the
containers, and Ogg is no exception though it will not have the
highest overhead. Despite this, Ogg is the only container discussed,
as if to imply this 'problem' is unique to Ogg.
Latency in an Ogg-based system is introduced at both the sender and
the receiver. Since the page header depends on the entire contents of
the page (packet sizes and checksum), a full page of packets must be
buffered by the sender before a single bit can be transmitted. ....
In a low-latency application, it is likely that no container, Ogg
included, would be buffering more than a single packet. Thus, pages
would be transmitted containing a single packet. As all containers
achieve low overhead by bundling packets into shared structures and
spreading Page/Cluster/What-have-you overhead across all the packets
in the unit, this results in much higher overhead for all
containers. Again, Ogg is not an exception.
sets a lower bound for the sending latency at the duration of a page.
On the receiving side, playback cannot commence until packets from all
elementary streams are available. Hence, with two streams (audio and
video) interleaved at the page level, playback is delayed by at least
one page duration (two if checksums are verified).
As presented, this makes no sense. How does interleave increase
latency except by conflating fixed bandwidth and latency? In
addition, checksumming does not double the latency 'with two streams'.
Audio and video are wholly independent. Packets are delivered
Taking both send and receive latencies into account, the minimum
end-to-end latency for Ogg is thus twice the duration of a page,
triple if strict checksum verification is required. .................
Again, this appears to makes no sense. The latency is exactly equal to
encoder latency + decoder latency + physical duration of a single
packet (how long it took to capture) + transmission latency. Checksum
has nothing at all to do with it.
................................................... If page durations
are variable, the maximum value must be used in order to avoid buffer
Minimum latency is clearly achieved by minimising the page duration,
which in turn implies sending only one packet per page. This is where
the size of the page header becomes important. The header for a
single-packet page is 27 + packet_size/255 bytes in size. For a 1 Mbps
video stream at 25 fps this gives an overhead of approximately
1%. With a typical audio packet size of 400 bytes, the overhead
becomes a staggering 7%. The average overhead for a multiplex of these
two streams is 1.4%.
These 'staggering' figures are representative of other containers
as well. As mentioned earlier, it's not unusual for RTP stream
headers to make up 25% of the data transmitted, though it would be
lower in this example. 1.4% overhead for single-packet latencies is ~
nothing, especially when using Ogg in its worst possible case.
As it stands, the Ogg format is clearly not a good choice for a
low-latency application. The key to low latency is small packets and
fine-grained interleaving of streams, and although Ogg can provide
both of these, by sending a single packet per page, the price in
overhead is simply too high.
"Clearly" based on what criteria? Typical Ogg overhead in an
audio/video stream is ~ 0.6-0.7%. With older muxers that figure was
closer to 1.1%. In the supposedly pathological scenario outlined
above, chosen to prove the latency point and make Ogg look bad, the
figure balloons to a portly 1.4%.
MPEG-TS, the container used to store audio and video on Blu-Ray,
starts out at 2.1% overhead [unachievable theoretical minimum] and
climbs steeply from there. If 1.4% is simply too high a price, I can
only imagine what a complete technical failure Blu-Ray must be.
ISO MPEG-PS has an overhead of 9 bytes on most packets (a 5-byte
timestamp is added a few times per second), and Microsoft's ASF has a
12-byte packet header. My suggestions for compacting the Ogg page
header would bring it in line with these formats.
What happened to using MP4 and Matroska for comparison? Possibly
they're not mentioned because MP4 cannot perform low-latency streaming
at all (actually impossible in the format) and Matroska's numbers are
similar to Ogg. Since the goal is to make Ogg look bad, we're now
comparing against MPEG-PS and ASF, which are offered for comparison
nowhere else in the article.
The byte-overheads offered for MPEG-PS and ASF are difficult to
verify against their specifications, as there are many conditional/optional fields in
both formats depending on intended use and the codecs to be contained.
ASF does not have a '12-byte header', the header is variable depending
on the stream options and codecs in use. MPEG-PS particularly defines
pages upon pages of customizations for each use case and codec/stream
type it contains.
So, let's measure an MPEG-PS file in its most common habitat: the
DVD. This is local-storage and not a low-latency scenario, so it
allows a more efficient encoding with far lower overhead than the
low-latency single-frame case. On the first three commercial DVDs
I've checked, the MPEG-PS overhead is over 2% despite the fact that
the codecs are providing their own framing, something the Ogg
container is wholly responsible for in the Ogg case. In other words,
MPEG-PS is performing far worse than Ogg in an easier case. 9
supposed bytes of overhead on most packets isn't telling anywhere near
the whole story.
Next, let's look at ASF. Remuxing each DVD (audio and video) into
ASF format produces an ASF file with approximately 1.5%
overhead. Checking several professionally produced 1Mbps ASF files
(as opposed to trying to mux them myself using ffmpeg) yields a figure
between 0.7% and 0.8% overhead, just a little higher than an Ogg also
muxed for local playback.
The lesson here is that the Ogg high-overhead outcry is a complete
Any general-purpose container format needs to allow random access for
direct seeking to any given position in the file. Despite this goal
being explicitly mentioned in the Ogg specification, the format only
allows the most crude of random access methods.
The primary random access method used in Ogg is an interpolated
bisection search, the same as used in Matroska and NUT.
While many container formats include an index allowing a time to be
directly translated into an offset into the file, Ogg has nothing of
this kind, ...........................................................
There is no index specified as part of the container low-level
transport mechanism, as Ogg abstracts transport and metadata into two
layers. The index is part of the stream metadata and strictly
optional in all cases, as the index only noticeably improves seek
performance in narrow interactive cases, such as HTTP range requests
over a satellite or WWAN link.
.......... the stated rationale for the omission being that this would
require a two-pass multiplexing, the second pass creating the
index. This is obviously not true; the index could simply be written
at the end of the file. Those objecting that this index would be
unavailable in a streaming scenario are forgetting that seeking is
impossible there regardless.
It is absolutely true that I resisted having an index of any sort
in Ogg. Front- or end-positioning the index is a secondary concern,
and borne more of the fact that there's a non-public argument in the
background between Xiph and other groups unwilling to support an
end-positioned index. Putting it at the beginning breaks the one-pass
That aside, my primary reasons for resisting an index are more
indirect and pragmatic:
- An index is only marginally useful in Ogg for the complexity
added; it adds no new functionality and seldom improves performance
noticeably. Why add extra complexity if it gets you nothing?
- 'Optional' indexes encourage lazy implementations that can seek
only when indexes are present, or that implement indexless seeking
only by building an internal index after reading the entire file
beginning to end.
Matroska, for example, supports indexless seeking using the same
basic algorithm/mechanisms as Ogg. Matroska has also always embraced
having an optional index. Although indexless seeking support in
Matroska is mandatory and the index optional, more Matroska
implementations appear to support the index than the mandatory
indexless method. Although Ogg appeared earlier, I worried that might
be the outcome of specifying an optional index, and so avoided one.
The Matroska result suggests I might have been right. Unfortunately,
there are some new use cases that finally make an index needed.
The method for seeking suggested by the Ogg documentation is to
perform a binary search on the file, after each file-level seek
operation scanning for a page header, extracting the timestamp, and
comparing it to the desired position. ..........................
A binary search is discussed in the spec for ease of comprehension;
implementation documents suggest an interpolated bisection search. So
far, this is the same as Matroska and NUT.
..................................... When the elementary stream
encoding allows only certain packets as random access points (video
key frames), a second search will have to be performed to locate the
entry point closest to the desired time. ..........................
By way of clarification, in the event that the result of the first
search does not land at a sync point, that first result does contain
the location of the sync point. Typically only one additional seek is
required to find it. This differs from Matroska in that the distance
to the preceding syncpoint in Matroska is not declared [is there an
undocumented declaration? Or is it just assumed that Matroska clusters
should always be big enough to hold a keyframe? Documentation needed!]
........................................ In a large file (sizes
upwards of 10 GB are common), 50 seeks might be required to find the
Demonstrably false. All you need to do is add a line that prints
'seek!' to any popular player software and perform some
scrubbing/searching to see that '50 seeks might be required' is
between 45 and 49 seeks too high, and that's for exact positioning,
The Vorbis source distribution includes an example program
called 'seeking_example' that does a stress-test of 5000 seeks of
different kinds within an Ogg file. Testing here with SVN r17178,
5000 seeks within a 10GB Ogg file constructed by concatenating 22
short Ogg videos of varying bitrates together results in 17459 actual
seek system calls. This yields a result of just under 3.5 real seeks
per Ogg seek request when doing exact positioning within an Ogg
file. Most actual seeking within an Ogg file would be more
appropriately implemented by scrubbing with a single physical
seek. This is the way mplayer seeks in Ogg, or the way seeking is
often done on a DVD.
A typical hard drive has an average seek time of roughly 10 ms, giving
a total time for the seek operation of around 500 ms, an annoyingly
long time. On a slow medium, such as an optical disc or files served
over a network, the times are orders of magnitude longer.
Leaving aside for a moment that this entire argument so far has
been refuted by measuring actual performance, latencies are longer
still over WWAN, satellite, or seeking over HTTP range requests. Yet
this seek system still works acceptably even in these ultra-high
latency cases. The actual number of physical seeks is low, far lower
than the unverified hand-wave guesstimation of 50. That said, when
latency gets this high, an index finally becomes obviously useful
enough to be worthwhile. It's the HTTP-over-satellite case that
finally convinced me that an index is a legitimate need.
A factor further complicating the seeking process is the possibility
of header emulation within the elementary stream data. To safeguard
against this, one has to read the entire page and verify the
checksum. If the storage medium cannot provide data much faster than
during normal playback, this provides yet another substantial delay
towards finishing the seeking operation. This too applies to both
network delivery and optical discs.
This ignores the fact that on all modern media, latency is almost
entirely in the seek. A seek plus a small read (a few bytes to a few
kB) is no faster than a seek plus a big read (a few kBytes to a few
pages). This is true even of HTTP requests. Mr. Rullgard's argument
is convincing only until one realizes that the complaint is not
supported by actual measurement.
Although optical disc usage is perhaps in decline today, one should
bear in mind that the Ogg format was designed at a time when CDs and
DVDs were rapidly gaining ground, and network-based storage is most
certainly on the rise.
This is a bit random. I'm not sure what it's trying to say... "Ogg
used to be awful" but "Ogg is only kind of awful right now" but "Ogg
will become really awful again, so watch out."
The final nail in the coffin of seeking is the codec-dependent
timestamp format. At each step in the seeking process, the timestamp
parsing specified by the codec mapping corresponding the current page
must be invoked. If the mapping is not known, the best one can do is
skip pages until one with a known mapping is found. This delays the
seeking and complicates the implementation, both bad things.
This conclusion does not stand up.
If one chooses to ignore the granule position mapping specified
in the header and calculate only using a software codec mapping (this
is indeed the original design as I suggested it) it is true that a
missing codec renders that logical stream undecodable, and the
timestamping for just that logical stream is lost as well. This does
not break the seeking in any
way, but it does mean that one can't make timing decisions based on
the undecodable pages.
It turns out that this affects measured timing almost unnoticeably
when, for example, the primary audio or video codec is entirely missing.
Again, this is because of the fact that when latencies are high,
latency is in the seek, not the read. When a non-primary codec is
missing (eg, a subtitle codec), the timing difference can't be
Not to mention, a missing primary codec is not the typical mode of
operation. Most users don't continue to watch a DVD if the video is
A problem old as multimedia itself is that of synchronising multiple
elementary streams (e.g. audio and video) during playback; badly
synchronised A/V is highly unpleasant to view. By the time Ogg was
invented, solutions to this problem were long since explored and
well-understood. The key to proper synchronisation lies in tagging
elementary stream packets with timestamps, packets carrying the same
timestamp intended for simultaneous presentation. The concept is as
simple as it seems, so it is astonishing to see the amount of
complexity with which the Ogg designers managed to imbue it. So
bizarre is it, that I have devoted an entire article to the topic, and
will not cover it further here.
As such, I also will address that writing later. The results will
be similar to the wholesale dismantling of the present article.
The summary, though, is that Ogg encapsulates in DTS order, and
encodes PTS, DTS and reference distance. NUT encapsulates in DTS
order and encodes PTS and DTS. Matroska encapsulates in DTS order and
encodes only PTS. Everything further is implementation details. It's
not particularly complicated, but we'll get to that particular set of
Mr. Rullgard's objections later.
Video and audio decoding are time-consuming tasks, so containers
should be designed to minimise extra processing required. With the
data volumes involved, even an act as simple as copying a packet of
compressed data can have a significant impact. Once again, however,
Ogg lets us down. Despite the brevity of the specification, the format
is remarkably complicated to parse properly.
I will suggest that those who are willing to grant without scrutiny
the assertion that Ogg is "remarkably complicated to use" go take a look
at the published specifications for a few other containers:
Note that Mr. Rullgard contributes to (contributed to?) the NUT design.
The unusual and inefficient encoding of the packet sizes limits the
page size to somewhat less than 64 kB. .........................
This is backwards. The limited page size allows the specific
encoding, not the other way around. Even in transOgg, which uses a
different encoding that could trivially allow much larger pages, the
page size is still limited to approximately 64kB. The limited size is
arbitrary and intentional in order to deliver on capture guarantees.
...................................... To still allow individual
packets larger than this limit, it was decided to allow packets
spanning multiple pages, a decision with unfortunate implications. .
Again, the cause and effect is backwards. Packets don't span pages
because of limitations of encoding, they span pages so that there's
guaranteed structure in the stream that doesn't require an unbounded
search to detect.
Page spanning becomes necessary when any single stream in the
multiplex reaches relatively high bitrates. In order to ground this
particular point in some actual numbers (using the libogg 1.2 muxer as
a reference), 30fps video packets would begin spanning at about
page-spanning packet as it arrives in the Ogg stream will be
discontiguous in memory, a situation most decoders are unable to
handle, and reassembly, i.e. copying, is required.
No. An implementation can obviously choose between at very least
iovecs or contiguous buffers via an extra copy. A copy is not
necessary and iovecs are not exotic. Zero-copy implementations of Ogg
can be seen in Tremor and the internally-used
The knowledgeable reader may at this point remark that the MPEG-TS
format also splits packets into pieces requiring reassembly before
decoding. There is, however, a significant difference there. MPEG-TS
was designed for hardware demultiplexing feeding directly into
hardware decoders. In such an implementation the fragmentation is not
a problem. Rather, the fine-grained interleaving is a feature allowing
smaller on-chip buffers.
Why is the MPEG-TS rationale granted as reasonable and the Ogg
rationale excluded without any explanation?
Buffering is also an area in which Ogg suffers. To keep the overhead
down, pages must be made as large as practically possible, and page
size translates directly into demultiplexer buffer size. Playback of a
file with two elementary streams thus requires 128 kB of buffer
This proposes poor muxing behavior. It is not necessary to make
pages "as large as practically possible" to "keep the overhead down".
Overhead is reduced only slightly by moving from a sensible muxing
behavior to the proposed absurdity above. The only time page sizes
should approach the maximum is when the compressed frames are
themselves approaching the maximum, implying high-bitrate streams. In
this case, the amount of working memory required for decode typically
This is no different from Matroska or MPEG or any other
container. A muxing strategy trades off latency and buffering
against overhead in all containers.
...... On a modern PC this is perhaps nothing to be concerned about,
but in a small embedded system, e.g. a portable media player, it can
It can indeed be relevant, and any competent engineer has numerous
tools and techniques at his fingertips to implement a solution, as
would be required for the other containers as well. Nothing here is
specific to Ogg.
In addition to the above, a number of other issues, some of them
minor, others more severe, make Ogg processing a painful experience. A
* 32-bit random elementary stream identifiers mean a simple
table-lookup cannot be used. Instead the list of streams must be
searched for a match. While trivial to do in software, it is
still annoying, and a hardware demultiplexer would be
significantly more complicated than with a smaller identifier.
Mr Rullgard objects to a feature that exists for a stated reason
not because he thinks the reason is invalid but because the
feature is annoying. I doubt any changes made to Ogg, no matter
how extensive, could avoid that fate.
* Semantically ambiguous streams are possible. For example, the
continuation flag (bit 1) may conflict with continuation (or
lack thereof) implied by the segment table on the preceding
page. Such invalid files have been spotted in the wild.
It is possible to generate invalid Ogg streams, just like it is
possible to generate invalid examples of every other container.
* Concatenating independent Ogg streams forms a valid
stream. While finding a use case for this strange feature is
difficult, an implementation must of course be prepared to
encounter such streams. Detecting and dealing with these adds
Concatenating streams together into new valid streams is also a
feature of Matroska, to which which Mr. Rullgard earlier
refers as a good general purpose format.
There's actually plenty to say about chained (concatenated)
streams, how best to spec and implement them, and whether they are in
fact worth the complexity. However, nothing insightful is added to
that discussion here, merely the naked opinion that it is 'pointless'.
* Unusual terminology: inventing new terms for well-known concepts
is confusing for the developer trying to understand the format
in relation to others. A few examples:
Ogg name Usual name
When Xiph started out in the early ninties, MPEG was hardly
dominant. To complain today that we did not internally adopt MPEG
terminology nearly 20 years ago is looking back with 20/20 hindsight.
Had RealNetworks remained the 600-lb gorilla they were 10 years ago,
would the complaint instead be we aren't using Real's terminology?
logical bitstream elementary stream
They don't mean the same thing, as it's ambiguous in MPEG
usage if an elementary stream is framed or unframed. In Ogg usage,
the 'logical bitstream' refers to unframed data belonging to a given
codec in an elementary or multiplexed stream. An 'elementary steam' is
a framed stream containing one 'logical stream'.
This usage was redacted and replaced with multiplexing.
lacing value packet size (approximately)
These are not the same thing. The packet size is the combination
of potentially several lacing values.
imaginary element serving no real purpose
A segment is the portion of a packet that appears on a given page.
In most cases a segment and packet are the same thing. When packets
span pages, a packet consists of more than one segment, each one on a
granule position timestamp
A granule position is not a timestamp. It is a synthetic value
that encodes DTS, PTS and reference distance. This difference is
central to multiple Ogg mechanisms.
We have found the Ogg format to be a dubious choice in just about
every situation. Why then do certain organisations and individuals
persist in promoting it with such ferocity?
At no point is Ogg compared against all containers in any given use
case. Mr. Rullgard performs no methodical compare-and-contrast.
He contructs narrow comparisons to show there is at least
one container that theoretically performs better in a given situation.
These assertions are exaggerated and contradicted by actual testing.
If anything, the discussion shows Ogg to be a good generalist,
occasionally topped in one case or another, but performing acceptably
to very well in the situations offered. It exists, it works, and it's
supported by nice people. That should be easy to understand.
When challenged, three types of reaction are characteristic of the Ogg
On occasion, these people will assume an apologetic tone, explaining
how Ogg was only ever designed for simple audio-only streams (ignoring
"These people"? Ahem. Staying on point:
A few ffmpeg and Matroska developers do claim that Ogg was designed
only for Vorbis but this
isn't true. I designed Ogg for any codec type from the beginning,
including discontinuous-time codecs like subtitles and overlays.
There had been earlier containers (from 1993-1998) used in the Ogg
project that were codec-specific and were not named 'Ogg'. They were
framings built into the various early codecs Xiph had worked on at
that time, eg 'Squish' and 'Stormbringer', just like mp3's framing
system is built into and used only in mp3.
The modern Ogg container design dates from approximately 1998, the
earliest Xiph.Org CVS entries are from 1999, and
formal documentation happened in 2000/2001
during the early Vorbis releases. At that time, Xiph was working on
two codecs, Vorbis and Tarkin. Most people don't
know about Tarkin; it was a research video codec just like Vorbis was
a research audio codec. Unlike Vorbis, Tarkin was not a successful
approach. Both Vorbis and Tarkin went into the Ogg container. Tarkin never saw release, and it was some time
until Xiph had another suitable video format to use in Ogg alongside
Vorbis. For many years, most of the world only saw Ogg paired with
The abandoned Tarkin codec can still be found in Xiph.Org SVN. Unfortunately, public archives for the
tarkin-dev list only go back to 2002, at which point nearly all the
activity surrounding Tarkin had already passed.
it is as bad for these as for anything), and this is no doubt
true. Why then, I ask again, do they continue to tout Ogg as the
one-size-fits-all solution they already admitted it is not?
What does this have to do with supposed flaws in the Ogg container?
More commonly, the Ogg proponents will respond with hand-waving
arguments best summarised as Ogg isn't bad, it's just different. My
reply to this assertion is twofold:
I have in fact said this. It's also true. Ogg has a number of
arbitrary differences Mr. Rullgard dislikes. Given ample opportunity,
he has not demonstrated in a logical fashion that his objection to
these differences has technical merit, he's only demonstrated that he
doesn't like them for possibly ill-considered reasons.
* Being too different is bad. We live in a world where multimedia
files come in many varieties, and a decent media player will
need to handle the majority of them. Fortunately, most
multimedia file formats share some basic traits, and they can
easily be processed in the same general framework, the specifics
being taken care of at the input stage. A format deviating too
far from the standard model becomes problematic.
This point that "too different is bad" can have merit. However the
conclusion that Ogg is "too different" might carry more weight if not
asserted by an individual on record as set against Ogg (and Xiph). Coupled to
the fact that several multimedia frameworks do support Ogg without
drama, the conclusion is far from proven.
* Ogg is bad. When every angle of examination reveals serious
flaws, bad is the only fitting description.
Mr. Rullgard is advised to wave his arms harder; a very different
conclusion is still visible to the reader.
The third reaction bypasses all technical analysis: Ogg is
patent-free, a claim I am not qualified to directly discuss. Assuming
it is true, it still does not alter the fact that Ogg is a bad
format. Being free from patents does not magically make Ogg a good
choice as file format. If all the standard formats are indeed covered
by patents, the only proper solution is to design a new, good format
which is not, this time hopefully avoiding the old mistakes.
Mercifully, we're at the end of the three closing thoughts which
- Different is bad. Ogg is bad.
- Ogg is bad.
- Even if it's patent free, Ogg is bad.
Rather than saying "no it isn't" a third time, I invite the reader
to nip off a bit early.
Comments are back at liveJournal.
-  http://ffmpeg.org/~mru/hardwarebug.org/2010/03/03/ogg-objections/index.html
-  http://news.slashdot.org/story/10/03/03/1913246/Technical-Objections-To-the-Ogg-Container-Format
-  http://xiph.org/ogg/doc/framing.html#page_header
-  http://www.ietf.org/rfc/rfc3533.txt
Ogg page header definition found in RFC 3533, page 8, section 6
-  http://www.matroska.org/technical/specs/codecid/index.html
Matroska documentation (and developers) use the term 'mapping'
sporadically, but it has the same meaning as in the Ogg context.
The list of Codec IDs at the above page also contains details of
the codec encapsulation identical to Ogg codec mapping.
-  http://en.wikipedia.org/wiki/Audio_Video_Interleave#Continued_use
-  http://wiki.xiph.org/Ogg_Skeleton#Ogg_Skeleton_version_3.0_Format_Specification
- The granule position parameters are declared in the secondary Skeleton header packet
-  http://hardwarebug.org/2008/11/17/ogg-timestamps-explored/
-  http://svn.xiph.org/trunk/oggds/
-  http://svn.xiph.org/trunk/oggdsf/
-  http://wiki.xiph.org/Vorbis_Hardware
-  http://wiki.xiph.org/Theora_Hardware
-  http://www.matroska.org/technical/specs/index.html
In section 'Lacing > Xiph Lacing'
-  http://xiph.org/ogg/doc/oggstream.html
Section 'Simple multiplexing' states:
multiplexes streams by interleaving pages from multiple elementary
streams into a multiplexed stream in time order. The multiplexed
pages are not altered. Muxing an Ogg AV stream out of separate
audio, video and data streams is akin to shuffling several decks
of cards together into a single deck; the cards themselves remain
unchanged. Demultiplexing is similarly simple (as the cards are
-  http://lwn.net/Articles/377928/
The file offered here by mp4 advocates was the subject of some
overhead debate; it was determined eventually that the correct
overhead for the mp4 file was 0.5% after initial claims that it was
approximately half that. 0.5% is lower but comparable to the
current average Ogg overhead of between 0.6% and 0.7%. Greg Maxwell
also remuxed the Ogg file for comparison purposes to show that 0.5%
was achievable for the Ogg file as well, though it really made no
sense to bother.
-  http://neuron2.net/library/mpeg2/iso13818-1.pdf
ISO/MPEG standard containing MPEG-PS stream specification
-  http://go.microsoft.com/fwlink/?LinkId=89814
Microsoft Advanced Streaming Format specification
-  http://xiph.org/ogg/doc/oggstream.html
-  Confirmed in #matroska, however no related documentation appears to exist.
-  http://wiki.multimedia.cx/index.php?title=NUT#NUT_seeking_algo_in_libnut
-  http://wiki.xiph.org/Ogg_Index
Note that as yet, the proposed Ogg Index spec linked here is still a draft. We have indeed just recently adopted an index.
-  http://svn.xiph.org/branches/lowmem-branch/Tremor/framing.c
-  http://svn.xiph.org/trunk/ogg2/
-  http://www.matroska.org/technical/specs/index.html
that the toplevel 'EBML' and 'Segment' elements are both marked
'Multi', meaning that any number may appear in a valid Matroska file.
-  http://www.matroska.org/technical/guides/faq/index.html
document states, "Ogg was designed to stream audio, specifically
Vorbis. Ogg was not designed to handle video, or any other type of
-  http://ffmpeg.org/~mru/hardwarebug.org/2010/03/03/ogg-objections/
this very document states, "On occasion, these people will assume an
apologetic tone, explaining how Ogg was only ever designed for
simple audio-only streams"
-  https://trac.xiph.org/browser/trunk/vorbis/lib/framing.c?rev=1
Ogg container code was already functional when we set up
the current CVS repository (now SVN) at Xiph.Org; the first Ogg
implementation predates this initial commit. The Ogg container and
everything else was originally in a single monolithic 'vorbis'
module, as can be seen in the first link from 1999. The Tarkin
source module (see  below) also originally included its own
duplicate implementation of the Ogg container copied from the
Vorbis module. Ogg got its own CVS entry when the monolithic
Vorbis module was split up in 2000 (second link).
-  https://trac.xiph.org/log/trunk/ogg/doc?action=follow_copy&mode=follow_copy&rev=16953&stop_rev=1&limit=100&verbose=on
-  http://en.wikipedia.org/wiki/Tarkin_%28codec%29#Ogg_codecs
-  http://svn.xiph.org/trunk/tarkin/bitpack.c
-  http://svn.xiph.org/trunk/tarkin/
'tarkin' was the initial experimental Tarkin codec. 'w3d' was a
second research version that continued Tarkin experimentation.
Neither approach was successful.
-  https://trac.xiph.org/changeset/3170/trunk/w3d
The last change to the Tarkin sourcebase, March 2002.