Monty - In Defense of Ogg's Good Name [entries|archive|friends|userinfo]

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

In Defense of Ogg's Good Name [Apr. 27th, 2010|12:38 pm]
Previous Entry Add to Memories Track This Next Entry
[Tags|, ]

Mans Rullgard has written two long rants about the Ogg container in the past few years. One[1] made it to Slashdot[2] apparently based on the drama potential alone. If you don't know what I'm talking about below, don't worry about it, tl;dr.

I'd not originally intended to respond to open trolling. The continued urging of many individuals has convinced me it's important to rebut in some public form. Earnest falsehoods left unchallenged risk being accepted as fact.

Addressing each set of assertions inline:

  1. Ogg objections
  3. The Ogg container format is being promoted by the Xiph Foundation for
  4. use with its Vorbis and Theora codecs. ..........................

By way of clarification, Ogg is for all codecs. That said, as a United States-based non-profit, the Xiph.Org Foundation charter strongly suggests we restrict ourselves to advocating unencumbered technologies.

  1. ...................................... Unfortunately, a number of
  2. technical shortcomings in the format render it ill-suited to most, if
  3. not all, use cases. This article examines the most severe of these
  4. flaws.

As I show below, the article fails to establish any examples of such flaws, except by rote assertion and spurious logic.

  2. Overview of Ogg
  4. The basic unit in an Ogg stream is the page consisting of a header
  5. followed by one or more packets from a single elementary stream. A
  6. page can contain up to 255 packets, and a packet can span any number
  7. of pages. The following table describes the page header.
  9. Field              Size (bits)      Description
  11. capture_pattern            32       magic number "OggS"
  12. version                     8       always zero
  13. flags                       8
  14. granule_position           64       abstract timestamp
  15. bitstream_serial_number    32       elementary stream number
  16. page_sequence_number       32       incremented by 1 each page
  17. checksum                   32       CRC of entire page
  18. page_segments               8       length of segment_table
  19. segment_table        variable       list of packet sizes
  21. Elementary stream types are identified by looking at the payload of
  22. the first few pages, which contain any setup data required by the
  23. decoders. For full details, see the official format specification.

This description of an Ogg page is accurate. The description and fields are easy to verify against the published Ogg spec description[3] at Xiph.Org and the RFCs[4] on the Ogg format.

  2. Generality
  4. Ogg, legend tells, was designed to be a general-purpose container
  5. format. ..............................................................

"Legend tells us"? Ogg is not a dramatic, unknowable mystery shrouded in the mists of time. I designed it. I'm alive and willing to answer any questions about the format. Allow me this opportunity to reiterate that Ogg was designed as a general purpose container.

  1. ....... To most multimedia developers, a general-purpose format is one
  2. in which encoded data of any type can be encapsulated with a minimum
  3. of effort.
  5. The Ogg format defined by the specification does not fit this
  6. description. For every format one wishes to use with Ogg, a complex
  7. mapping must first be defined. ......................................

Some further elaboration from the horse's mouth: Mapping is a term I coined for the process of formally documenting how a codec will be placed into a container. Every container involves details beyond 'plop raw compressed frames into the container and you're done.' Some details include specifying codec magic (eg, the "FOURCC" in AVI, the 'Magic' in Ogg), choosing an appropriate timebase (or how to convert to the container's timebase), how one indicates keyframes/sync points, how this data is submitted to the container, and so on. Mappings also allow a given codec to take targeted advantage of the features offered by a particular container. One example is mp3 in Matroska, where the mapping specifies that the mp3 header is to be treated as duplicated/compressed data. Mappings need only be specified once and they're done.

By definition, mapping must be done for any codec into any container, even if the mapping is relatively trivial. This is true of MP4/MOV, Matroska, Ogg, NUT, AVI, and every other container. Some containers, like Ogg and Matroska[5], explicitly describe and document mapping, as well as the codec mappings themselves. Other containers document mappings but have no explicit name for it. A few remainders like AVI neither institutionalize the process of mapping, nor reliably document how codec data is contained, leading to an 'anything goes' situation of widespread ambiguity and compatibility conflicts[6].

In short, every container has codec mappings whether they are explicit or implicit or even well-formed. The Ogg project has a name for the process. It is disingenuous to claim that Ogg is inferior to some other container that requires these same decisions, but has no name for the process, or worse, no process at all.

  1. .............................. This mapping defines how to identify a
  2. codec, how to extract setup data, and even how timestamps are to be
  3. interpreted. All this is done differently for every codec. ..

It would be silly to do it over and over if it was the same every time.

  1. .......................................................... To
  2. correctly parse an Ogg stream, every such mapping ever defined must be
  3. known.

This is commonly asserted by detractors, but a combination of false and missing the point.

Ogg transport is based entirely on the page structure primitive, described accurately above. There are no other structures in the container transport itself. Higher level structures are built out of pages, not built into them. All Ogg streams conform to this page structure and all Ogg streams are parseable and demuxable without knowing anything about the codec. "Drop the needle" anywhere in an Ogg stream and start demuxing; you get the codec data out without knowing anything about the codec. You possibly won't know what exactly to do with that data without the codec mapping and the data is possibly useless without the codec anyway, but that's true of every container.

To avoid being accused of sidestepping the issue, I posit that the actual [if unstated] objection is that the Ogg container does not fully specify the granule position in the transport specification. Beyond a few requirements, a codec mapping defines the granule position spec for that codec's streams, not the Ogg spec. In theory, this would mean that without codec knowledge or some other place to find the granule position definition, a decoder missing the codec for a given stream would not be able to determine the timestamp on the stream that it is not capable of decoding anyway. In practice, the granule position mapping does in fact exist in the stream metadata within the Skeleton header[7] (as it would be in Matroska or NUT). Additionally, the Ogg design allows implementations to ignore the pretty design theory and just do things the way other containers do by building granule position calculation into the mux implementation.

There's specific considered reasons for the granulepos design which take some space to explain accurately. Because Mr. Rullgard also wrote a lengthy diatribe against Ogg timestamping[8], I'll leave the explanation for there and link to it here when my response to the other article is live.

  2. Under this premise, a centralised repository of codec mappings would
  3. seem like a sensible idea, but alas, no such thing exists. It is
  4. simply impossible to obtain a exhaustive list of defined mappings,
  5. which makes the task of creating a complete implementation somewhat
  6. daunting.

The mappings exist, they are not held all in one place. As we do not control all the codecs, we have not sought to control all the mappings. It's also not clear that we should hold or promote mappings for encumbered codecs (as per charter).

However, a centralized repository for mappings is an obviously desirable thing. At present, codec mappings are documented in the codec specifications themselves. A page of simple links, which we should have, would address your objection. Thus it is hardly a "severe flaw" in the container.

  2. One brave soul, Tobias Waldvogel, ..................................

Brave soul? Was Tobias single-handedly staring down a Xiph panzer division as he did this?

  1. ................................. created a mapping, OGM, capable of
  2. storing any Microsoft AVI compatible codec data in Ogg files. This
  3. format saw some use in the wild, but was frowned upon by Xiph, and it
  4. was eventually displaced by other formats.

OGM used the Ogg page structure (mostly correctly) though with private data for the VfW framework. The result was parseable as Ogg container but containing an ugly Windows-specific hack. We objected because it was not well formed and confused users who thought it was regular Ogg. It was a quick and dirty fork.

For the record, Tobias later joined Xiph along with his DirectShow filters[9] and deprecated OGM. OGM is no longer supported in our DirectShow offerings[10].

  2. True generality is evidently not to be found with the Ogg format.

The ad-hoc 'evidence' above fails to justify this conclusion.

  1. A good example of a general-purpose format is Matroska. This container
  2. can trivially accommodate any codec, all it requires is a unique
  3. string to identify the codec. ..................................

In summary, mappings are a serious flaw in Ogg, but an advantage in Matroska? Matroska mappings go into considerably more detail[5] than a FOURCC string, as is implied above.

The problem with Matroska mappings is not that they exist, but that they are not nearly detailed enough. This is not a flaw of the Matroska container, merely the documentation, and I am certainly not innocent of inadequate documentation myself. Ogg documentation is just as bad and in places much worse. I assert that the single largest problem in both Ogg and Matroska is the lack of sufficiently detailed, high-quality documentation. Both projects describe what the container is and how it is formatted. Neither project sufficiently documents the proper way to use it.

  1. ............................. For codecs requiring setup data, a
  2. standard location for this is provided in the container. ............ in Ogg. From the Ogg bitstream documentation, the stream starts with:

  • The initial header for each stream appears in sequence, each header on a single page. All initial headers must appear with no intervening data (no auxiliary header pages or packets, no data pages or packets). Order of the initial headers is unspecified. The 'beginning of stream' flag is set on each initial header.
  • All auxiliary headers for all streams must follow. Order is unspecified. The final auxiliary header of each stream must flush its page.
  • Data pages for each stream follow, interleaved in time order.
  1. ........................................................ Furthermore,
  2. an official list of codec identifiers is maintained, meaning all
  3. information required to fully support Matroska files is available from
  4. one place.

Detailed documentation (or the lack thereof) is vitally important, however it has little to do with the container design itself. Mr. Rullgard claims to establish that Ogg is badly flawed, not that it needs more documentation.

  2. Matroska also has probably the greatest advantage of all: it is in
  3. active, wide-spread use. .............................................

Ogg and Matroska share this advantage, though deployment only slightly overlaps. Ogg, Vorbis and Theora are all in silicon and firmware on countless portable devices[11][12]. Matroska has seen penetration into the home DVD player market. Both have nearly universal support in third-party software players.

  1. ........................ Historically, standards derived from existing
  2. practice have proven more successful than those created by a design
  3. committee. 

I'm not sure what this is meant to imply-- that I have multiple personalities? I need a t-shirt that says "I AM COMMITTEE".

Ogg wasn't the product of a committee. I designed it. That said, h.264 is the result of possibly the largest committee the world has ever known. I think we all agree it's a great format, even if many of us object to the thousands of patents involved.

Lastly, the critique so far does not mention or enumerate ways in which Ogg breaks with established practice. Ogg is modelled loosely on a simplified MPEG-TS/PS design. All the design elements, including the ones to which Mr. Rullgard objects, appear at some point in other containers.

  2. Overhead
  4. When designing a container format, one important consideration is that
  5. of overhead, i.e. the extra space required in addition to the
  6. elementary stream data being combined. For any given container, the
  7. overhead can be divided into a fixed part, independent of the total
  8. file size, and a variable part growing with increasing file size. The
  9. fixed overhead is not of much concern, its relative contribution being
  10. negligible for typical file sizes.

As with the last section, the overhead discussion begins with a few basic facts nearly anyone can agree with.

  1. The variable overhead in the Ogg format comes from the page headers,
  2. mostly from the segment_table field. This field uses a most peculiar
  3. encoding, somewhat reminiscent of Roman numerals. ...............

It is a 'most peculiar' encoding, designed to have near-constant overhead regardless of packet size. It is so ludicrous that Matroska also adopted it[13]. Atamido of #matroska estimated in IRC that approximately half of Matroska streams use the "Xiph lacing" (I assume this is a very round estimate, but it indicates that the Matroska designers do not consider it so peculiar).

  1. ................................................. In Roman times,
  2. numbers were written as a sequence of symbols, each representing a
  3. value, the combined value being the sum of the constituent values.
  5. The segment_table field lists the sizes of all packets in the
  6. page. Each value in the list is coded as a number of bytes equal to
  7. 255 followed by a final byte with a smaller value. The packet size is
  8. simply the sum of all these bytes. Any strictly additive encoding,
  9. such as this, has the distinct drawback of coded length being linearly
  10. proportional to the encoded value. A value of 5000, a reasonable
  11. packet size for video of moderate bitrate, requires no less than 20
  12. bytes to encode.

Though correct, this does not explore why such an encoding might be desirable.

The issue with a typical variable length encoding that extends off of the leading bits is that you burn at least a full bit of range even in the shortest length encodings. Let's look at the EBML value encoding in Matroska:

bits, big-endian
1xxx xxxx                                                                              - value 0 to  2^7-2
01xx xxxx  xxxx xxxx                                                                   - value 0 to 2^14-2
001x xxxx  xxxx xxxx  xxxx xxxx                                                        - value 0 to 2^21-2
0001 xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx                                             - value 0 to 2^28-2
0000 1xxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx                                  - value 0 to 2^35-2
0000 01xx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx                       - value 0 to 2^42-2
0000 001x  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx            - value 0 to 2^49-2
0000 0001  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx  xxxx xxxx - value 0 to 2^56-2

It's not that one leading bit is expensive, it's that it reduces the max length of a one-byte encoding from 254 to 126 (in Matroska, 0xff is reserved, so the max value is 2^x-2 not 2^x-1). For example, if a leading bit of '0' signifies 'one byte of length' and a bit of '1' means 'extend to more bytes', then any length greater than 126 bytes uses two bytes of length encoding.

This boundary turns out to be somewhat significant. Most low-rate audio codecs tend to hover right around or just above this value, and even video easily goes this low. So, using the NUT length encoding or the Matroska EBML encoding, you nearly always add an extra byte to each packet's length encoding in low rate streams. When you're coding, eg, 150 byte packets, overhead due to Ogg lacing is 0.67% per packet. Using Matroska EBML or NUT encoding, the length-encoding overhead is 1.3% per packet. This is why Matroska also uses the Xiph encoding.

That said, transOgg [the next rev of Ogg; if the Google VP8 leak is true, we'll have some breathing room soon to start more aggressively developing it] will use a lacing that pivots off of value 252 rather than 255. In this way, we still avoid 'wasting' an entire bit of numerical range when extending, but we avoid the runs of 255 to which Mr. Rullgard objects. And, it's truly the best of both worlds, which means there's no need for multiple optional encodings.

  2. On top of this we have the 27-byte page header which, although paling
  3. in comparison to the packet size encoding, is still much larger than
  4. necessary. Starting at the top of the list:

No comparisons are offered against other formats. So far this all has implied that Ogg's overhead is ludicrously high compared to other containers. That's not the case; Ogg is among the lower-overhead containers, yet guaranteed to be inherently streamable. A streamable structure burns bytes as illustrated below, but Ogg overhead today still hovers at around 0.6-0.7% for high-rate video and can always capture from any point in under 128kB (usually around 4kB in practice). For low rate anything, no other container currently matches its efficiency. This is one reason the length encoding is still the way it is; even with length-encoding runs of 255, it is an insignificant enough thing that no one sane had previously cared.

  2.     * The version field could be disposed of, a single-bit marker
  3.       being adequate to separate this first version from hypothetical
  4.       future versions. One of the unused positions in the flags field
  5.       could be used for this purpose

Disposing of the version field may be a reasonable suggestion (eg, the upcoming transOgg does omit the version field), though no justification or pro/con is explored. The idea of a bit marker is similarly not justified or explained. This indicates that Mr. Rullgard is unaware what the field was actually for.

The version field had originally been intended to allow multiple Ogg page types tuned for different payloads to coexist in the same stream. The Ogg container format froze much earlier than the Ogg codecs did, and as the 2000s wore on it became clear we would only ever use one page version (version zero). Again, the contribution to overhead was negligible and it was left as is rather than break spec and require every adopter to upgrade (a difficult thing when an implementation you paid for is in hardware).

Moving forward, using a versioned capture pattern is perhaps more sensible and this is the transOgg approach. Discussion of versioned capture will be part of the transOgg docs.

  2.     * A 64-bit granule_position is completely overkill. 32 bits would
  3.       be more than enough for the vast majority of use cases. In
  4.       extreme cases, a one-bit flag could be used to signal an
  5.       extended timestamp field.

Presupposing that the granule position is intended only to be a timestamp (which is not the case), 64 bits is hardly overkill as practical use has demonstrated regularly. Similarly, using 64 bits rather than 32 eliminates a conditionally triggered mechanism. Though variable length and optional fields are not evil, there's no reason to use them indiscriminately either. At some point, every unnecessary mechanism just contributes to bug count.

However, the granule position is not simply a timestamp. It is a synthetic value that encodes DTS, PTS and distance to first-needed reference. The suggestion that it should be reduced from 64 bits ignores a substantial portion of the Ogg design. Muxing, seeking, and verification are all designed on top of the granule position construct. Completely missing this design aspect demonstrates incomplete understanding.

For comparison purposes, Matroska Cluster timecodes are explicitly declared in EBML. To signal presence of a timecode, two additional bytes must be used. In this manner, Matroska timecodes are six bytes even when storing only 32 byte values. To store a 64 bit timecode, 80 bits must be used. I do not consider Matroska's encoding overhead to be unreasonable.

  2.     * 32-bit elementary stream number? Are they anticipating files
  3.       with four billion elementary streams? An eight-bit field, if not
  4.       smaller, would seem more appropriate here.

The stream ID is intended to be used like a weak hash. If stream ID numbers collide in a muxing or concatenation operation, altering the stream ID number requires renumbering every page in the stream (this would be the case in any other container as well), and also requires the checksum on every page be recomputed. Having a large pseudo-random ID space makes such collisions vanishingly unlikely, eliminating the need for continuous recalculation of page headers at every muxing step.

Recall that the Ogg design treats the pages of a stream like a deck of cards[14]; one multiplexes two streams by shuffling two decks together with no other changes, making muxing and demuxing a nearly trivial operation that can be performed on-the-fly with nearly zero CPU on live streaming servers. The large stream ID is part of this design.

  2.     * The 32-bit page_sequence_number is inexplicable. The intent is
  3.       to allow detection of page loss due to transmission errors. ISO
  4.       MPEG-TS uses a 4-bit counter per 188-byte packet for this
  5.       purpose, and that format is used where packet loss actually
  6.       happens, unlike any use of Ogg to date.

A 32 bit sequence allows direct UDP unicast/multicast with Ogg handling reordering and reassembly. The biggest reason for such a large number is that the Ogg granule position can't be relied upon for sequencing/ordering, especially when using low MTUs in which Ogg packets could span UDP packets (and thus Ogg pages). The 32 bit sequence also allows keyed encryption without continuous rekeying, or moderate-length stream interruptions resulting in permanent loss of keying/capture when the sequence number rolls over. The sequence number is also used, as you state, for gap detection in other cases.

In transOgg, we're exploring the use of an extended granule position that replaces the sequence field both for gap detection and UDP ordering. It's not clear yet that will be enough.

  2.     * A mandatory 32-bit checksum is nothing but a waste of space when
  3.       using a reliable storage/transmission medium. Again, a flag
  4.       could be used to signal the presence of an optional checksum
  5.       field.

The checksum is part of the capture mechanism. I will note that the NUT container (contributed to by Mr. Rullgard) uses a 64 bit capture pattern. Ogg uses a 32 bit capture + 32 bit checksum for a total of 64 bits. The captures have equivalent behavior, but Ogg also gets error detection out of it.

Very occasional corruption does happen both in network transmission and local file storage. I have personally had files corrupt due to decayed spinning media. It is incorrect to claim that it never happens.

  2. With the changes suggested above, the page header would shrink from 27
  3. bytes to 12 bytes in size.

It would also gut the container functionality. Assuming 4kB pages (which is approximately what would be used in practice for audio with low-rate video), the loss of functionality gains back 0.35% overhead. For high-rate video (as page size climbs) the 'advantage' from adopting these suggestions eventually drops to 0.02%. This is cutting off your nose to spite your face.

For comparison purposes, an Ogg page is the conceptual equivalent of a Matroska Cluster + the SimpleBlocks inside. A minimal Matroska Cluster (containing only a single SimpleBlock, only one frame, no checksum, 64 bit presentation timestamp, no references, no optional/auxiliary fields or features, no data) is 23 bytes. An Ogg page header is always 27 bytes, but it also provides sequencing, CRC, strong capture, gap detection, DTS, and codec delay.

The actual overheads seen depend on the relative size (and thus header frequency) of Matroska Blocks and Clusters compared to Ogg pages, so header size alone means little.

  2. We thus see that in an Ogg file, the packet size fields alone
  3. contribute an overhead of 1/255 or approximately 0.4%. This is a hard
  4. lower bound on the overhead, not attainable even in theory. In reality
  5. the overhead tends to be closer to 1%.

Basically correct except that the practical overhead of Ogg files, using libogg 1.2 as the muxer, is typically 0.6%-0.7% across the board.

  2. Contrast this with the ISO MP4 file format, which can easily achieve
  3. an overhead of less than 0.05% with a 1 Mbps elementary stream.

MP4 overhead climbs to Ogg levels[15] when an MP4 file is remuxed such that it can be streamed (played via progressive download), otherwise the file must be downloaded completely before playback can begin. It is also not possible to stream live in MP4 at all; the bitstream format simply does not have the feature. Corrupt the index on an ultra-low overhead MP4 muxing, and you stand to lose the whole file. In summary, such tight muxing has significant practical tradeoffs.

It is also odd to compare Ogg to a file format that is missing features that render it unusable in the situations for which Ogg was designed and is currently being used (eg, live streaming). Comparing to Matroska is more reasonable. Matroska can stream, both live and in progressive download, though the file may need to be muxed with streaming in mind. In this situation, Matroska and Ogg overheads are roughly comparable. The 'winner' depends on bitrate and mux latency.

transOgg, which will use the new lacing described above, retains Ogg's 'always streamable' design and currently reduces theoretical minimum overhead to 0.035%. Like with other low-overhead containers, this number is achievable but probably not in any truly useful case. When comparing apples to apples, most of the containers in wide use today have similar overhead numbers even when the theoretical minimums vary widely.

  2. Latency
  4. In many applications end-to-end latency is an important
  5. factor. Examples include video conferencing, telephony, live sports
  6. events, interactive gaming, etc. With the codec layer contributing as
  7. little as 10 milliseconds of latency, the amount imposed by the
  8. container becomes an important factor.

It is jarring to complain about high overhead, then immediately demand low-latency performance. The same container typically is not used in both low-overhead and low-latency applications as overhead and latency are a nearly direct tradeoff. Low latency containers (such as MPEG-TS, or if you think about it as a container, RTP) are all fantastically high overhead. It is not absurd for an RTP stream, for example, to exceed overhead figures of 25%. It is inescapable.

Ogg is not optimal for low and ultra-low latency applications, though it can still be used effectively just as can any of the other low-overhead containers (except MP4; it can't stream live at all). The overhead figures will be relatively high for all of the containers, and Ogg is no exception though it will not have the highest overhead. Despite this, Ogg is the only container discussed, as if to imply this 'problem' is unique to Ogg.

  2. Latency in an Ogg-based system is introduced at both the sender and
  3. the receiver. Since the page header depends on the entire contents of
  4. the page (packet sizes and checksum), a full page of packets must be
  5. buffered by the sender before a single bit can be transmitted. ....

In a low-latency application, it is likely that no container, Ogg included, would be buffering more than a single packet. Thus, pages would be transmitted containing a single packet. As all containers achieve low overhead by bundling packets into shared structures and spreading Page/Cluster/What-have-you overhead across all the packets in the unit, this results in much higher overhead for all containers. Again, Ogg is not an exception.

  1. .............................................................. This
  2. sets a lower bound for the sending latency at the duration of a page.
  4. On the receiving side, playback cannot commence until packets from all
  5. elementary streams are available. Hence, with two streams (audio and
  6. video) interleaved at the page level, playback is delayed by at least
  7. one page duration (two if checksums are verified).

As presented, this makes no sense. How does interleave increase latency except by conflating fixed bandwidth and latency? In addition, checksumming does not double the latency 'with two streams'. Audio and video are wholly independent. Packets are delivered all-at-once.

  2. Taking both send and receive latencies into account, the minimum
  3. end-to-end latency for Ogg is thus twice the duration of a page,
  4. triple if strict checksum verification is required. .................

Again, this appears to makes no sense. The latency is exactly equal to encoder latency + decoder latency + physical duration of a single packet (how long it took to capture) + transmission latency. Checksum has nothing at all to do with it.

  1. ................................................... If page durations
  2. are variable, the maximum value must be used in order to avoid buffer
  3. underflows.
  5. Minimum latency is clearly achieved by minimising the page duration,
  6. which in turn implies sending only one packet per page. This is where
  7. the size of the page header becomes important. The header for a
  8. single-packet page is 27 + packet_size/255 bytes in size. For a 1 Mbps
  9. video stream at 25 fps this gives an overhead of approximately
  10. 1%. With a typical audio packet size of 400 bytes, the overhead
  11. becomes a staggering 7%. The average overhead for a multiplex of these
  12. two streams is 1.4%.

These 'staggering' figures are representative of other containers as well. As mentioned earlier, it's not unusual for RTP stream headers to make up 25% of the data transmitted, though it would be lower in this example. 1.4% overhead for single-packet latencies is ~ nothing, especially when using Ogg in its worst possible case.

  2. As it stands, the Ogg format is clearly not a good choice for a
  3. low-latency application. The key to low latency is small packets and
  4. fine-grained interleaving of streams, and although Ogg can provide
  5. both of these, by sending a single packet per page, the price in
  6. overhead is simply too high.

"Clearly" based on what criteria? Typical Ogg overhead in an audio/video stream is ~ 0.6-0.7%. With older muxers that figure was closer to 1.1%. In the supposedly pathological scenario outlined above, chosen to prove the latency point and make Ogg look bad, the figure balloons to a portly 1.4%.

MPEG-TS, the container used to store audio and video on Blu-Ray, starts out at 2.1% overhead [unachievable theoretical minimum] and climbs steeply from there. If 1.4% is simply too high a price, I can only imagine what a complete technical failure Blu-Ray must be.

  2. ISO MPEG-PS has an overhead of 9 bytes on most packets (a 5-byte
  3. timestamp is added a few times per second), and Microsoft's ASF has a
  4. 12-byte packet header. My suggestions for compacting the Ogg page
  5. header would bring it in line with these formats.

What happened to using MP4 and Matroska for comparison? Possibly they're not mentioned because MP4 cannot perform low-latency streaming at all (actually impossible in the format) and Matroska's numbers are similar to Ogg. Since the goal is to make Ogg look bad, we're now comparing against MPEG-PS and ASF, which are offered for comparison nowhere else in the article.

The byte-overheads offered for MPEG-PS and ASF are difficult to verify against their specifications[16][17], as there are many conditional/optional fields in both formats depending on intended use and the codecs to be contained. ASF does not have a '12-byte header', the header is variable depending on the stream options and codecs in use. MPEG-PS particularly defines pages upon pages of customizations for each use case and codec/stream type it contains.

So, let's measure an MPEG-PS file in its most common habitat: the DVD. This is local-storage and not a low-latency scenario, so it allows a more efficient encoding with far lower overhead than the low-latency single-frame case. On the first three commercial DVDs I've checked, the MPEG-PS overhead is over 2% despite the fact that the codecs are providing their own framing, something the Ogg container is wholly responsible for in the Ogg case. In other words, MPEG-PS is performing far worse than Ogg in an easier case. 9 supposed bytes of overhead on most packets isn't telling anywhere near the whole story.

Next, let's look at ASF. Remuxing each DVD (audio and video) into ASF format produces an ASF file with approximately 1.5% overhead. Checking several professionally produced 1Mbps ASF files (as opposed to trying to mux them myself using ffmpeg) yields a figure between 0.7% and 0.8% overhead, just a little higher than an Ogg also muxed for local playback.

The lesson here is that the Ogg high-overhead outcry is a complete red herring.

  2. Random access
  4. Any general-purpose container format needs to allow random access for
  5. direct seeking to any given position in the file. Despite this goal
  6. being explicitly mentioned in the Ogg specification, the format only
  7. allows the most crude of random access methods.

The primary random access method used in Ogg is an interpolated bisection search[18], the same as used in Matroska[19] and NUT[20].

  2. While many container formats include an index allowing a time to be
  3. directly translated into an offset into the file, Ogg has nothing of
  4. this kind, ...........................................................

There is no index specified as part of the container low-level transport mechanism, as Ogg abstracts transport and metadata into two layers. The index is part of the stream metadata and strictly optional in all cases, as the index only noticeably improves seek performance in narrow interactive cases, such as HTTP range requests over a satellite or WWAN link.

  1. .......... the stated rationale for the omission being that this would
  2. require a two-pass multiplexing, the second pass creating the
  3. index. This is obviously not true; the index could simply be written
  4. at the end of the file. Those objecting that this index would be
  5. unavailable in a streaming scenario are forgetting that seeking is
  6. impossible there regardless.

It is absolutely true that I resisted having an index of any sort in Ogg. Front- or end-positioning the index is a secondary concern, and borne more of the fact that there's a non-public argument in the background between Xiph and other groups unwilling to support an end-positioned index. Putting it at the beginning breaks the one-pass design stance.

That aside, my primary reasons for resisting an index are more indirect and pragmatic:

  • An index is only marginally useful in Ogg for the complexity added; it adds no new functionality and seldom improves performance noticeably. Why add extra complexity if it gets you nothing?
  • 'Optional' indexes encourage lazy implementations that can seek only when indexes are present, or that implement indexless seeking only by building an internal index after reading the entire file beginning to end.

Matroska, for example, supports indexless seeking using the same basic algorithm/mechanisms as Ogg. Matroska has also always embraced having an optional index. Although indexless seeking support in Matroska is mandatory and the index optional, more Matroska implementations appear to support the index than the mandatory indexless method. Although Ogg appeared earlier, I worried that might be the outcome of specifying an optional index, and so avoided one. The Matroska result suggests I might have been right. Unfortunately, there are some new use cases that finally make an index needed[21].

  2. The method for seeking suggested by the Ogg documentation is to
  3. perform a binary search on the file, after each file-level seek
  4. operation scanning for a page header, extracting the timestamp, and
  5. comparing it to the desired position. ..........................

A binary search is discussed in the spec for ease of comprehension; implementation documents suggest an interpolated bisection search. So far, this is the same as Matroska and NUT.

  1. ..................................... When the elementary stream
  2. encoding allows only certain packets as random access points (video
  3. key frames), a second search will have to be performed to locate the
  4. entry point closest to the desired time. ..........................

By way of clarification, in the event that the result of the first search does not land at a sync point, that first result does contain the location of the sync point. Typically only one additional seek is required to find it. This differs from Matroska in that the distance to the preceding syncpoint in Matroska is not declared [is there an undocumented declaration? Or is it just assumed that Matroska clusters should always be big enough to hold a keyframe? Documentation needed!]

  1. ........................................ In a large file (sizes
  2. upwards of 10 GB are common), 50 seeks might be required to find the
  3. correct position.

Demonstrably false. All you need to do is add a line that prints 'seek!' to any popular player software and perform some scrubbing/searching to see that '50 seeks might be required' is between 45 and 49 seeks too high, and that's for exact positioning, not scrubbing.

The Vorbis source distribution includes an example program called 'seeking_example' that does a stress-test of 5000 seeks of different kinds within an Ogg file. Testing here with SVN r17178, 5000 seeks within a 10GB Ogg file constructed by concatenating 22 short Ogg videos of varying bitrates together results in 17459 actual seek system calls. This yields a result of just under 3.5 real seeks per Ogg seek request when doing exact positioning within an Ogg file. Most actual seeking within an Ogg file would be more appropriately implemented by scrubbing with a single physical seek. This is the way mplayer seeks in Ogg, or the way seeking is often done on a DVD.

  2. A typical hard drive has an average seek time of roughly 10 ms, giving
  3. a total time for the seek operation of around 500 ms, an annoyingly
  4. long time. On a slow medium, such as an optical disc or files served
  5. over a network, the times are orders of magnitude longer.

Leaving aside for a moment that this entire argument so far has been refuted by measuring actual performance, latencies are longer still over WWAN, satellite, or seeking over HTTP range requests. Yet this seek system still works acceptably even in these ultra-high latency cases. The actual number of physical seeks is low, far lower than the unverified hand-wave guesstimation of 50. That said, when latency gets this high, an index finally becomes obviously useful enough to be worthwhile. It's the HTTP-over-satellite case that finally convinced me that an index is a legitimate need.

  2. A factor further complicating the seeking process is the possibility
  3. of header emulation within the elementary stream data. To safeguard
  4. against this, one has to read the entire page and verify the
  5. checksum. If the storage medium cannot provide data much faster than
  6. during normal playback, this provides yet another substantial delay
  7. towards finishing the seeking operation. This too applies to both
  8. network delivery and optical discs.

This ignores the fact that on all modern media, latency is almost entirely in the seek. A seek plus a small read (a few bytes to a few kB) is no faster than a seek plus a big read (a few kBytes to a few pages). This is true even of HTTP requests. Mr. Rullgard's argument is convincing only until one realizes that the complaint is not supported by actual measurement.

  2. Although optical disc usage is perhaps in decline today, one should
  3. bear in mind that the Ogg format was designed at a time when CDs and
  4. DVDs were rapidly gaining ground, and network-based storage is most
  5. certainly on the rise.

This is a bit random. I'm not sure what it's trying to say... "Ogg used to be awful" but "Ogg is only kind of awful right now" but "Ogg will become really awful again, so watch out."

  2. The final nail in the coffin of seeking is the codec-dependent
  3. timestamp format. At each step in the seeking process, the timestamp
  4. parsing specified by the codec mapping corresponding the current page
  5. must be invoked. If the mapping is not known, the best one can do is
  6. skip pages until one with a known mapping is found. This delays the
  7. seeking and complicates the implementation, both bad things.

This conclusion does not stand up.

If one chooses to ignore the granule position mapping specified in the header and calculate only using a software codec mapping (this is indeed the original design as I suggested it) it is true that a missing codec renders that logical stream undecodable, and the timestamping for just that logical stream is lost as well. This does not break the seeking in any way, but it does mean that one can't make timing decisions based on the undecodable pages.

It turns out that this affects measured timing almost unnoticeably when, for example, the primary audio or video codec is entirely missing. Again, this is because of the fact that when latencies are high, latency is in the seek, not the read. When a non-primary codec is missing (eg, a subtitle codec), the timing difference can't be measured.

Not to mention, a missing primary codec is not the typical mode of operation. Most users don't continue to watch a DVD if the video is missing.

  2. Timestamps
  4. A problem old as multimedia itself is that of synchronising multiple
  5. elementary streams (e.g. audio and video) during playback; badly
  6. synchronised A/V is highly unpleasant to view. By the time Ogg was
  7. invented, solutions to this problem were long since explored and
  8. well-understood. The key to proper synchronisation lies in tagging
  9. elementary stream packets with timestamps, packets carrying the same
  10. timestamp intended for simultaneous presentation. The concept is as
  11. simple as it seems, so it is astonishing to see the amount of
  12. complexity with which the Ogg designers managed to imbue it. So
  13. bizarre is it, that I have devoted an entire article to the topic, and
  14. will not cover it further here. 

As such, I also will address that writing later. The results will be similar to the wholesale dismantling of the present article.

The summary, though, is that Ogg encapsulates in DTS order, and encodes PTS, DTS and reference distance. NUT encapsulates in DTS order and encodes PTS and DTS. Matroska encapsulates in DTS order and encodes only PTS. Everything further is implementation details. It's not particularly complicated, but we'll get to that particular set of Mr. Rullgard's objections later.

  2. Complexity
  4. Video and audio decoding are time-consuming tasks, so containers
  5. should be designed to minimise extra processing required. With the
  6. data volumes involved, even an act as simple as copying a packet of
  7. compressed data can have a significant impact. Once again, however,
  8. Ogg lets us down. Despite the brevity of the specification, the format
  9. is remarkably complicated to parse properly.

I will suggest that those who are willing to grant without scrutiny the assertion that Ogg is "remarkably complicated to use" go take a look at the published specifications for a few other containers:

Note that Mr. Rullgard contributes to (contributed to?) the NUT design.

  2. The unusual and inefficient encoding of the packet sizes limits the
  3. page size to somewhat less than 64 kB. .........................

This is backwards. The limited page size allows the specific encoding, not the other way around. Even in transOgg, which uses a different encoding that could trivially allow much larger pages, the page size is still limited to approximately 64kB. The limited size is arbitrary and intentional in order to deliver on capture guarantees.

  1. ...................................... To still allow individual
  2. packets larger than this limit, it was decided to allow packets
  3. spanning multiple pages, a decision with unfortunate implications. .

Again, the cause and effect is backwards. Packets don't span pages because of limitations of encoding, they span pages so that there's guaranteed structure in the stream that doesn't require an unbounded search to detect.

Page spanning becomes necessary when any single stream in the multiplex reaches relatively high bitrates. In order to ground this particular point in some actual numbers (using the libogg 1.2 muxer as a reference), 30fps video packets would begin spanning at about 15Mbps.

  1. .................................................................. A
  2. page-spanning packet as it arrives in the Ogg stream will be
  3. discontiguous in memory, a situation most decoders are unable to
  4. handle, and reassembly, i.e. copying, is required.

No. An implementation can obviously choose between at very least iovecs or contiguous buffers via an extra copy. A copy is not necessary and iovecs are not exotic. Zero-copy implementations of Ogg can be seen in Tremor[22] and the internally-used libogg2[23].

  2. The knowledgeable reader may at this point remark that the MPEG-TS
  3. format also splits packets into pieces requiring reassembly before
  4. decoding. There is, however, a significant difference there. MPEG-TS
  5. was designed for hardware demultiplexing feeding directly into
  6. hardware decoders. In such an implementation the fragmentation is not
  7. a problem. Rather, the fine-grained interleaving is a feature allowing
  8. smaller on-chip buffers.

Why is the MPEG-TS rationale granted as reasonable and the Ogg rationale excluded without any explanation?

  2. Buffering is also an area in which Ogg suffers. To keep the overhead
  3. down, pages must be made as large as practically possible, and page
  4. size translates directly into demultiplexer buffer size. Playback of a
  5. file with two elementary streams thus requires 128 kB of buffer
  6. space. .............................................................

This proposes poor muxing behavior. It is not necessary to make pages "as large as practically possible" to "keep the overhead down". Overhead is reduced only slightly by moving from a sensible muxing behavior to the proposed absurdity above. The only time page sizes should approach the maximum is when the compressed frames are themselves approaching the maximum, implying high-bitrate streams. In this case, the amount of working memory required for decode typically eclipses buffering.

This is no different from Matroska or MPEG or any other container. A muxing strategy trades off latency and buffering against overhead in all containers.

  1. ...... On a modern PC this is perhaps nothing to be concerned about,
  2. but in a small embedded system, e.g. a portable media player, it can
  3. be relevant.

It can indeed be relevant, and any competent engineer has numerous tools and techniques at his fingertips to implement a solution, as would be required for the other containers as well. Nothing here is specific to Ogg.

  2. In addition to the above, a number of other issues, some of them
  3. minor, others more severe, make Ogg processing a painful experience. A
  4. selection follows:
  6.     * 32-bit random elementary stream identifiers mean a simple
  7.       table-lookup cannot be used. Instead the list of streams must be
  8.       searched for a match. While trivial to do in software, it is
  9.       still annoying, and a hardware demultiplexer would be
  10.       significantly more complicated than with a smaller identifier.

Mr Rullgard objects to a feature that exists for a stated reason not because he thinks the reason is invalid but because the feature is annoying. I doubt any changes made to Ogg, no matter how extensive, could avoid that fate.

  2.     * Semantically ambiguous streams are possible. For example, the
  3.       continuation flag (bit 1) may conflict with continuation (or
  4.       lack thereof) implied by the segment table on the preceding
  5.       page. Such invalid files have been spotted in the wild.

It is possible to generate invalid Ogg streams, just like it is possible to generate invalid examples of every other container.

  2.     * Concatenating independent Ogg streams forms a valid
  3.       stream. While finding a use case for this strange feature is
  4.       difficult, an implementation must of course be prepared to
  5.       encounter such streams. Detecting and dealing with these adds
  6.       pointless complexity.

Concatenating streams together into new valid streams is also a feature of Matroska[24], to which which Mr. Rullgard earlier refers as a good general purpose format.

There's actually plenty to say about chained (concatenated) streams, how best to spec and implement them, and whether they are in fact worth the complexity. However, nothing insightful is added to that discussion here, merely the naked opinion that it is 'pointless'.

  2.     * Unusual terminology: inventing new terms for well-known concepts
  3.       is confusing for the developer trying to understand the format
  4.       in relation to others. A few examples:
  6.       Ogg name                  Usual name

When Xiph started out in the early ninties, MPEG was hardly dominant. To complain today that we did not internally adopt MPEG terminology nearly 20 years ago is looking back with 20/20 hindsight. Had RealNetworks remained the 600-lb gorilla they were 10 years ago, would the complaint instead be we aren't using Real's terminology?

  1.       logical bitstream         elementary stream

They don't mean the same thing, as it's ambiguous in MPEG usage if an elementary stream is framed or unframed. In Ogg usage, the 'logical bitstream' refers to unframed data belonging to a given codec in an elementary or multiplexed stream. An 'elementary steam' is a framed stream containing one 'logical stream'.

  1.       grouping                  multiplexing

This usage was redacted and replaced with multiplexing.

  1.       lacing value              packet size (approximately)

These are not the same thing. The packet size is the combination of potentially several lacing values.

  1.       segment                   imaginary element serving no real purpose

A segment is the portion of a packet that appears on a given page. In most cases a segment and packet are the same thing. When packets span pages, a packet consists of more than one segment, each one on a separate page.

  1.       granule position          timestamp

A granule position is not a timestamp. It is a synthetic value that encodes DTS, PTS and reference distance. This difference is central to multiple Ogg mechanisms.

  2. Final words
  4. We have found the Ogg format to be a dubious choice in just about
  5. every situation. Why then do certain organisations and individuals
  6. persist in promoting it with such ferocity?

At no point is Ogg compared against all containers in any given use case. Mr. Rullgard performs no methodical compare-and-contrast. He contructs narrow comparisons to show there is at least one container that theoretically performs better in a given situation. These assertions are exaggerated and contradicted by actual testing.

If anything, the discussion shows Ogg to be a good generalist, occasionally topped in one case or another, but performing acceptably to very well in the situations offered. It exists, it works, and it's supported by nice people. That should be easy to understand.

  2. When challenged, three types of reaction are characteristic of the Ogg
  3. campaigners.
  5. On occasion, these people will assume an apologetic tone, explaining
  6. how Ogg was only ever designed for simple audio-only streams (ignoring

"These people"? Ahem. Staying on point:

A few ffmpeg and Matroska developers do claim that Ogg was designed only for Vorbis[25][26] but this isn't true. I designed Ogg for any codec type from the beginning, including discontinuous-time codecs like subtitles and overlays.

There had been earlier containers (from 1993-1998) used in the Ogg project that were codec-specific and were not named 'Ogg'. They were framings built into the various early codecs Xiph had worked on at that time, eg 'Squish' and 'Stormbringer', just like mp3's framing system is built into and used only in mp3.

The modern Ogg container design dates from approximately 1998, the earliest Xiph.Org CVS entries are from 1999[27], and formal documentation happened in 2000/2001[28] during the early Vorbis releases. At that time, Xiph was working on two codecs, Vorbis and Tarkin[29]. Most people don't know about Tarkin; it was a research video codec just like Vorbis was a research audio codec. Unlike Vorbis, Tarkin was not a successful approach. Both Vorbis and Tarkin went into the Ogg container[30]. Tarkin never saw release, and it was some time until Xiph had another suitable video format to use in Ogg alongside Vorbis. For many years, most of the world only saw Ogg paired with Vorbis.

The abandoned Tarkin codec can still be found in Xiph.Org SVN[31]. Unfortunately, public archives for the tarkin-dev list only go back to 2002, at which point nearly all the activity surrounding Tarkin had already passed[32].

  1. it is as bad for these as for anything), and this is no doubt
  2. true. Why then, I ask again, do they continue to tout Ogg as the
  3. one-size-fits-all solution they already admitted it is not?

What does this have to do with supposed flaws in the Ogg container?

  2. More commonly, the Ogg proponents will respond with hand-waving
  3. arguments best summarised as Ogg isn't bad, it's just different. My
  4. reply to this assertion is twofold:

I have in fact said this. It's also true. Ogg has a number of arbitrary differences Mr. Rullgard dislikes. Given ample opportunity, he has not demonstrated in a logical fashion that his objection to these differences has technical merit, he's only demonstrated that he doesn't like them for possibly ill-considered reasons.

  2.     * Being too different is bad. We live in a world where multimedia
  3.       files come in many varieties, and a decent media player will
  4.       need to handle the majority of them. Fortunately, most
  5.       multimedia file formats share some basic traits, and they can
  6.       easily be processed in the same general framework, the specifics
  7.       being taken care of at the input stage. A format deviating too
  8.       far from the standard model becomes problematic.

This point that "too different is bad" can have merit. However the conclusion that Ogg is "too different" might carry more weight if not asserted by an individual on record as set against Ogg (and Xiph). Coupled to the fact that several multimedia frameworks do support Ogg without drama, the conclusion is far from proven.

  2.     * Ogg is bad. When every angle of examination reveals serious
  3.       flaws, bad is the only fitting description.

Mr. Rullgard is advised to wave his arms harder; a very different conclusion is still visible to the reader.

  2. The third reaction bypasses all technical analysis: Ogg is
  3. patent-free, a claim I am not qualified to directly discuss. Assuming
  4. it is true, it still does not alter the fact that Ogg is a bad
  5. format. Being free from patents does not magically make Ogg a good
  6. choice as file format. If all the standard formats are indeed covered
  7. by patents, the only proper solution is to design a new, good format
  8. which is not, this time hopefully avoiding the old mistakes.

Mercifully, we're at the end of the three closing thoughts which consisted of:

  1. Different is bad. Ogg is bad.
  2. Ogg is bad.
  3. Even if it's patent free, Ogg is bad.

Rather than saying "no it isn't" a third time, I invite the reader to nip off a bit early.

Comments are back at liveJournal.






Ogg page header definition found in RFC 3533, page 8, section 6


Matroska documentation (and developers) use the term 'mapping' sporadically, but it has the same meaning as in the Ogg context. The list of Codec IDs at the above page also contains details of the codec encapsulation identical to Ogg codec mapping.


The granule position parameters are declared in the secondary Skeleton header packet







In section 'Lacing > Xiph Lacing'


Section 'Simple multiplexing' states:

Ogg multiplexes streams by interleaving pages from multiple elementary streams into a multiplexed stream in time order. The multiplexed pages are not altered. Muxing an Ogg AV stream out of separate audio, video and data streams is akin to shuffling several decks of cards together into a single deck; the cards themselves remain unchanged. Demultiplexing is similarly simple (as the cards are marked).

The file offered here by mp4 advocates was the subject of some overhead debate; it was determined eventually that the correct overhead for the mp4 file was 0.5% after initial claims that it was approximately half that. 0.5% is lower but comparable to the current average Ogg overhead of between 0.6% and 0.7%. Greg Maxwell also remuxed the Ogg file for comparison purposes to show that 0.5% was achievable for the Ogg file as well, though it really made no sense to bother.


ISO/MPEG standard containing MPEG-PS stream specification


Microsoft Advanced Streaming Format specification


Section "Seeking"

[19] Confirmed in #matroska, however no related documentation appears to exist.



Note that as yet, the proposed Ogg Index spec linked here is still a draft. We have indeed just recently adopted an index.




Note that the toplevel 'EBML' and 'Segment' elements are both marked 'Multi', meaning that any number may appear in a valid Matroska file.


document states, "Ogg was designed to stream audio, specifically Vorbis. Ogg was not designed to handle video, or any other type of audio."


this very document states, "On occasion, these people will assume an apologetic tone, explaining how Ogg was only ever designed for simple audio-only streams"


Ogg container code was already functional when we set up the current CVS repository (now SVN) at Xiph.Org; the first Ogg implementation predates this initial commit. The Ogg container and everything else was originally in a single monolithic 'vorbis' module, as can be seen in the first link from 1999. The Tarkin source module (see [6] below) also originally included its own duplicate implementation of the Ogg container copied from the Vorbis module. Ogg got its own CVS entry when the monolithic Vorbis module was split up in 2000 (second link).





'tarkin' was the initial experimental Tarkin codec. 'w3d' was a second research version that continued Tarkin experimentation. Neither approach was successful.


The last change to the Tarkin sourcebase, March 2002.