]>
The first Ghost update was an introduction and overview of the underlying codec structure under research for Ghost. First and foremost among structural novelties, Ghost is to be a hybrid time/frequency codec that splits and separately encodes tonal and atonal content. Though not a new idea, hybrid encoding use in a practical general purpose audio codec is somewhat novel (for several good reasons). Part of the purpose of Ghost is to reevaluate hybrid coding to see if any new discoveries make the hybrid approach more advantageous.
In this update, we look at details of our frequency-domain analysis and representation.
The FFT (Fast Fourier Transform) and similar related transforms that offer smooth overlapping blocks (such as the MDCT) are all transforms that map a time-domain signal into the frequency domain, that is, a summed series of a fixed set of sinusoids. These transforms are 'critically sampled', meaning that N time domain input samples yields N output sinusoidal amplitude values; the number of values in equals the number of values out. They are also (at least) bi-orthogonal; the process is lossless and can be reversed exactly.
The FFT is a remarkable transform and offers such a level of mathematical convenience and power that it is pressed into service as a way of optimizing or approximating countless other transforms. It is a big enough hammer that at some point, every signal processing problem starts to look like just another nail (or one of your thumbs).
Because the audio we hear is, in a sense, a series of summed sinusoids, the FFT (or more commonly today, the closely related MDCT) also seems ideal as a good transform space for near-ideal representation of audio signals, usually offering substantial energy compaction over the time domain. The frequency domain is close enough to how audio 'really works' it's been the default coding domain for audio for over twenty years. Though very close, it is not a perfect fit.
Fixed-basis discrete-time Fourier transforms like the FFT and MDCT have some limitations when used for audio. Not all sounds are composed of a small or neat series of sinusoids; many are not periodic at all. Time->Frequency transforms are not 'causal'; quantization and coding error in the frequency domain spreads both backward and forward in time, causing pre and post-echo artifacts. Any sounds that are not a fairly simple sum of long-term steady-state sinusoids become a wideband spray of energy in the frequency domain, resulting in poor energy compaction and thus low coding efficiency.
Even neat, perfect sinusoids can run into representational trouble in the frequency domain. Sinusoids that are modulated in amplitude and/or frequency, or sinusoids that fall between bins (do not match one of the basis frequencies exactly) also become spread out across the entire frequency spectrum. Most of this energy spray is low energy, but collected in its entirety it is not negligible. Discarding this 'long tail' of leaked frequency-domain energy is responsible for the artifacts we hear in transform codecs.
What can we gain by relaxing the constraints imposed by a classic fixed-basis transform like the FFT and MDCT?
The first obvious constraint we can relax is the fixed, evenly spaced frequency basis, allowing representation of any frequency, not just sinusoids spaced a fixed distance apart. In short, we turn the FFT:
...into an arbitrary collection of sinusoids where
This obviously loses the critical sampling properties of the FFT in that a fixed number of input samples can turn into any number of output sinusoids. We also (in practice) lose bi-orthogonality, but we're not interested in exactly representing the time signal; we merely want a better fit to specific sinusoidal features of our choosing.
The generalization also eliminates the possibility of a directly calculated closed-form solution. We must use a general purpose linear system solver; such a solver calculates an approximation to a solution, inspects the error, and then uses the previous approximation and error to construct another, better approximation. This process repeats (iterates) until the approximation either achieves the desired level of precision or fails to converge on a stable answer.
We can add rate-of-change of the amplitude (essentially slow amplitude modulation) as a new axis of freedom to the sinusoidal fit:
(The above modifies the sinusoidal basis function to ramp up from zero amplitude, ramp down to zero amplitude, or anything in-between. When ΔA==0 and ω are restricted to integer values, it's the regular discrete Fourier transform.)
This is not new information to the analysis process as we can determine (and normally do determine) the same approximate amplitude change information using successive FFTs. However, adding frequency modulation as a fit parameter allows tighter modeling of a signal than determining an amplitude delta after-the-fact.
Strong periodic signals are often but not always fixed-frequency sinusoids. Audio also contains linear-frequency chirps, where the frequency changes linearly over time (these are also called 'quadratic' chirps as they're described by a quadratic formula). Log frequency chirps (with logarithmic frequency change over time) are an even more common occurrence in both music and audio in general. For convenience we can approximate log chirps with successive linear chirps.
Extension of the basis functions to cover arbitrary linear frequency chirps is straightforward:
We can generalize the discrete Fourier transform, which returns N amplitude values for N fixed, evenly-spaced sinusoids spanning from DC to the Nyquist frequency, with arbitrary chirps that have five real degrees of freedom: Amplitude, Phase, Frequency, Amplitude', and Frequency', where prime indicates first derivative. We could go further and add second derivatives or other tangentially related qualities. For now, we make the arbitrary decision that this 'feels like' a good set, and will be the basis of further research. Five degrees of freedom, none of them inherently quantized, already describe a large potential expansion of input information. Subject to subsequent research, it is more likely that five degrees is past the ideal number rather than short of it.
The FFT presents us with effectively three parameters per sinusoid:
The MDCT is similar, but the output is a vector of real amplitudes rather than complex amplitudes. The total number of output values remains the same. Despite the amplitudes all being real values, they exhibit left/right symmetries/anti-symmetries that behave similarly to complex pairs.
For both the FFT and MDCT, these values are a unique solution. For any given input signal, the basis frequencies and resulting amplitude outputs at each basis frequency are a fixed mapping. Note again that the frequencies are often implicit and the amplitudes simply listed in order for each basis frequency.
The chirp domain is simultaneously massively over- and under-determined. It is overdetermined in the sense that we have more fit axes/parameters than data points, but underdetermined in the sense that we're only interested in 'significant' fits, where the determination of what constitutes 'significant' is a problem left to the researcher.
An FFT is able to provide a good-enough estimate of several parameters of a chirp fit. Specifically, the average amplitude, approximate center phase, and approximate center frequency of a chirp can be estimated directly from an FFT. We can estimate amplitude modulation from the results of two successive FFTs. The sticking point is estimation of chirp rate (frequency modulation). Once the chirp rate exceeds approximately .25-1.0 bins per FFT, which we expect it will on a regular basis in natural audio and music, estimation using successive FFTs becomes noisy and error-prone.
At this point, we notice that our sinusoidal estimation code exhibits a wider range of convergence when used with a good window than noted in our original 2007 paper. When coupled with a nonlinear iteration that recenters both frequency and chirp rate, both convergence and output precision improve further. In particular, the convergence range of the chirp rate parameter can be widened arbitrarily, which is an especially lucky break. It becomes perfectly practical to fit chirps using an initial chirp rate estimate of zero, letting the linear solver find the answer without additional hinting. In-depth details of sinusoidal estimator behavior including window effects will be the subject of the next update.
The sinusoidal estimation has an additional advantage over a brute-force DCFT-style approach; because it fits in order of estimated energy, it tends to reduce the interference from sidelobe leakage. This reduction of sidelobe pollution, coupled with the fact that most chirps evolve from stable frequencies (or move slowly) appears to result in more than adequate performance from simply using an initial fit estimate of zero for the chirp rate.
Exceptionally tight chirp fitting potentially allows better energy compaction into fewer parameters, more than offsetting the fact parameter expansion means we'll need to code more data per sinusoid (It's a lovely theory in any case). This is a benefit to accurate audio analysis, allowing us to account accurately for more signal energy, even if we decide to discard or degrade the additional precision when coding rather than coding the full depth.
Tight chirp fits grant another benefit; the mechanism by which we fit chirps also becomes the mechanism by which we track chirps from frame to frame. We can extrapolate chirps forward (and backward) seamlessly following sinusoidal energy as it moves between frames and FFT bins. Forward extrapolation of a chirp from a preceding frame into the new frame becomes the initial estimate to the chirp fit estimation algorithm. This eliminates the need to search for the chirp all over again in a new frame, and then try to match it to a preexisting chirp from earlier.
The practical tradeoff with going to a generalized chirp encoding is the expansion in the number and precision of parameters to encode. The hope is that better energy compaction into fewer chirps offsets the fact that each individual chirp is potentially more expensive to encode. That said, there are more and less expensive ways to encode chirps, some of which can reduce the additional expense while still potentially maintaining the advantages of the tighter fit. Only testing will tell.
Vorbis and other present-day monolithic transform codecs like AAC use the MDCT for their frequency domain transform. Such a codec will encode the logical equivalent of every MDCT output value in order to avoid coding frequency offsets, though coding will be optimized to assume many of the amplitude values will quantize to zero. Each frame's MDCT is stand-alone, meaning that when we eventually decode the frame, none of the data needed to reverse the MDCT transform relies on data from any other frames (though the output is overlapped with neighboring frames). No values are predicted across frames.
The amplitudes produced as output of an MDCT can be thought of as a specific, restricted set of Gabor atoms. Each amplitude and implicit basis frequency, along with our window function, describes a single Gabor atom. All the atoms are constrained to begin and end on a frame boundary, only atoms corresponding to the basis frequencies are possible, and the modulation parameters are always zero.
Viewing MDCT output as a highly-restricted set of Gabor atoms makes it clear that our proposed chirp fitting can be as simple as a relaxation of the constraints on the possible set of Gabor atoms we can choose to use. We still restrict atoms to beginning/ending on frame boundaries, but we no longer restrict the center frequency of the atom to a fixed basis set, nor do we clamp the modulation terms to zero.
This generalization allows us to code fewer sinusoids more precisely. The obvious tradeoff is that rather than our atoms being described by a single explicit parameter (a list of amplitudes), the atoms are now described by five explicit parameters.
As with the MDCT, a stand-alone frame requires that we do not predict or preserve any chirp parameter data across frames. We code into each frame all the data needed to reconstruct the time-domain signal.
We're not necessarily stuck coding five full-range parameters for every chirp. From frame-to-frame, most chirps will predict rather well a matching chirp in a following frame, a trick much harder to achieve with the MDCT. If we remove the requirement that frames must be strictly stand-alone (using no parameter information carried over from a another frame), we can use the chirp parameters from a preceding frame to predict the chirp parameters of the current frame, much like inter-coding in video codecs.
At the beginning of a new frame, we know the phase, amplitude and instantaneous frequency of a previous frame's chirp at the frame boundary (as well as across the portion of the previous frame that overlaps with the current frame). If we're content to eliminate some axes of freedom fitting a current frame chirp, we can reduce the number of fit parameters by fixing a current frame chirp to a preceding frame chirp at some point. For purposes of illustration, let's assume we're fixing the current chirp to the previous chirp at the starting boundary of the current frame.
Up to now, we've defined the phase, amplitude and frequency parameters at the center of the fit window. We now alter the chirp fit to interpret phase, amplitude and frequency parameters at the left boundary of the chirp rather than the center, then fit the current frame's chirp with these three parameters fixed to the values taken from the preceding chirp at the frame boundary.
As the left edge of the chirp is fixed, the things that can change across the frame are the amplitude and instantaneous frequency. Thus, we've rendered three of our five parameters for the new frame's chirps implicit, leaving only the need to fit the amplitude and frequency modulation across the current frame.
This two-parameter chirp arrangement still synthesizes chirps the same way as the five parameter chirps, using overlapped Gabor atoms. It merely borrows some chirp parameters from the preceding frame.
Rather than constructing and overlap-adding Gabor atoms, we can synthesize chirps directly from free-running oscillators, realizing a potential factor of two savings in synthesis cost. Analysis could be performed exactly as it is in the two-chirp arrangement with overlapping frames above, but the native synthesized frames would be half the size and non-overlapping.
Continuous chirp synthesis is not exactly equivalent to the overlap-add schemes above as we're no longer adding together overlapped chirp atoms with potentially disagreeing parameters.
In the case of independently constructed Gabor atoms, the process of sewing together frames is as straightforward as it is in the MDCT (which is itself just a special case of Gabor atoms). Just like in the MDCT we need only use a power complementary window, overlap and add.
Once we're constructing chirps from frame to frame as if they're the output of free-running oscillators, things become more complicated. We no longer have a smooth overlap/add window, and so we no longer have implicitly smooth transitions between sets of chirp parameters. Is piecewise linear chirp construction (discontinuous in the first derivative) sufficiently smooth for high fidelity reproduction? Or will additional transitional smoothing be necessary to move discontinuities into higher derivatives or eliminate them altogether?
Conceptually, there's no difficulty in constructing arbitrarily smooth transitioning between parameter updates. That said, smoothing increases computational complexity. In addition, smoothing not reflected in the original chirp fitting algorithm results in synthesized chirps that are not equivalent to the fitted parameters.