]>
The first Ghost update was an introduction and overview of the underlying codec structure under research for Ghost. The codec's central novelty is that Ghost is to be a hybrid time/frequency codec that splits and separately encodes tonal and atonal content. This is not a new idea, though it has not yet been successfully deployed in a practical, general purpose codec for several good reasons. Part of the purpose of Ghost is to reevaluate hybrid coding to see if any new discoveries make the hybrid approach practical or advantageous.
The first demo already showed rudimentary sinusoidal estimation and fitting in action. Demo two demonstrated that our sinusoidal estimator was practically capable of tracking an isolated log-chirp. Now we get started on the hard part: using the sinusoidal estimator as part of a larger system that detects and tracks a large number of simultaneous sinusoids/chirps in real audio, analyzes the set for structure, evaluates for perceptual importance, and then splits the audio into fundamentally frequency-domain data and fundamentally time-domain data.
The first step in splitting out our sinusoids is to initially detect the sinusoids. Conceptually this is relatively straightforward, but we're constrained by practical performance issues. Below are a few simple techniques explored so far to find obvious tones as well as ferret out tones hidden in noise and other spectral structures.
This is a relatively naive technique that:
Though I labeled this technique 'naive', it has solid empirical grounding as a good baseline for usable performance and as such it's the starting point many other techniques build upon. It can be further augmented easily with perceptual weighting, tonality estimation, harmonic analysis and so on. For example, we may try to look for and 'fill in' weak harmonics such as were not picked in step 4.
Empirical evidence from past codec development (eg, Speex and Vorbis) suggests that the critical band energy curve computed as step 2 above benefits from further modification to be more useful as a masking curve. In strongly tonal areas the energy curve rides well above the noise floor, while in atonal areas of the spectrum it tends to sit well under the audible noise. As such, noisy portions of the spectrum may trigger a substantial number of single-frame false positives as spurious peaks exceed the fixed sinusoid detection threshold.
Addition of a 'tonality estimation' may be useful to bias sinusoid selection toward or against noisy areas. Though it's trivial to achieve the desired amount of noise rejection via such a weighting, quantifying what the desired level actually is requires listening testing; this is the next step of research and not yet performed. The curve computation being used by Ghost is not equivalent to the curve used in Vorbis, and as such, empirical data can't be directly transferred.
Proper 'Phase Coherence' analysis performs a least-squares amplitude/phase fit for a given frequency at fixed intervals in time, then computes the 'coherence' as the inverse of the variance of expected phase over the values in a given time period. The thought is that sinusoids may be more obviously visible via streaks of coherent phase (low phase variance) even when noise hides low amplitude but steady tones.
Phase coherence computation has the same bootstrapping problem we're trying to solve: What frequencies to select for fitting? The 'Phase Coherence Spectrum' uses a fixed spacing, but this is not really appropriate here; we'd be duplicating some of what our sinusoidal estimator does, but in a more limited way for a high computational cost. What we want is to extract some of the same information, but at as low a cost as possible so that the expensive sinusoidal estimation converges as quickly as possible.
Thus we modify the phase coherence technique to track the relative phase variance of the individual bins of an FFT across time, using the bins with low phase variance to choose candidate sinusoids. What we end up with is akin to a bank of resonators that integrate coherent sinusoidal energy over a period of time.
Unfortunately, multiple practical problems render phase coherence a poor metric for sinusoid selection. A few problems include:
In short, phase coherence hasn't yet proven itself to find useful sinusoidal information that isn't already apparent to simple energy thresholding. Looking at variance over long periods has shown no demonstrable advantage over simply using a longer time window to an FFT.
This does bring up interesting questions relating to perception and efficient coding of reverberation; I'm unaware of preexisting research on the subject, at least in the context of auditory modeling.
Cepstral analysis techniques (there are many variants) essentially look for strong harmonic structures in a frequency spectrum by doing a Fourier transform (some use forward, some backward) on a log-amplitude version of the frequency spectrum. A regular harmonic sequence is a repeating pattern of amplitude spikes; this isn't a periodic signal, but it behaves enough like one such that the further transform consolidates the harmonics into a single, strong peak in the 'Cepstral' domain.
The Cepstrum is especially valuable for analyzing voice, where it excels as finding the fundamental and basic formant structure of vowels and voiced consonants. Unfortunately, the cepstrum is only good at picking out a single harmonic structure; multi-voice sounds substantially impair its sensitivity (it can detect both, but with far less margin against noise) and pure sinusoids are completely invisible to the technique.
As expected, cepstral analysis excels at finding the fundamentals in voice samples. It's not clear that it's more useful than naive thresholding elsewhere.
Short tone bursts aren't perceived as tones, but rather as clicks or bands of noise with tonal coloration. The transition from click to tone is gradual and frequency dependent.
This suggests that acceptable sinusoidal seeding and tracking will need to be performed hierarchically, with higher frequencies using faster detection/tracking, and lower frequencies using slower tracking.
Though conceptually unrelated, multi-scale analysis should also be tested using slow, long-window FFTs to better see faint sinusoids hidden in noise. It's possible that sinusoidal sensation may not be a binary quantity, but that long tone bursts would be audible at much lower thresholds than short tone bursts (I expect some research has been done on this, I just need to find it). Do these tones contribute to sinusoidal sensation, or only to narrowband noise color?
As explained in demo 2, the mechanism by which we fit chirps also becomes the mechanism by which we track chirps from frame to frame. We can extrapolate chirps forward (and backward) seamlessly following sinusoidal energy as it moves between frames and FFT bins. Forward extrapolation of a chirp from a preceding frame into the new frame becomes the initial estimate to the chirp fit estimation algorithm. This eliminates the need to search for the chirp all over again in a new frame, and then try to match it to a preexisting chirp from earlier.
Naturally, the only question is whether this technique can actually be made to work in practice on real audio.
At the left is an initial test run using compmono.wav. The audio track is the result synthesized from the sinusoids chosen and tracked. The video portion is a visualization of the tracking data. The top half of the pane is a rolling spectrogram with overlaid cyan lines showing the continuously tracked frequencies of each chirp. The lower half of the pane is the instantaneous spectrum synchronized to the audio. The green line is the FFT of the original audio, and each white 'lollipop' is a chirp that's being tracked.
This test was performed using a naive spreading function without further analysis (plotted as the white curve in the lower half of the telemetry video to the left). Any peak more than 4dB above the spreading function is seeded to the chirp tracking list. A seeded peak that tracks above the spreading function level for 100ms is preserved and output. A chirps is tracked until such time as it drops below the spreading curve and is weeded from the list.
Several things are worth noting:
For completeness, below are the original compmono track as well as the sinusoidal and 'residue' audio remaining after the sinusoidal content is subtracted.
In summary, a promising first attempt.
Another test, another test sample. Deeply layered electronic music (synthetic), crisp but subtle textures, a few very strong pure tones.
[BTW, go read Homestuck!]
Voice is one of my primary motivations for a parametric harmonic sinusoidal model; voice is very strongly and regularly harmonic and so it should model and code well. However, the hard threshold based only on a spreading function falls rather flat here. Although the estimator tracks the vocal harmonics nicely where it's seeded, the seeding is hitting correctly at best 50% of the time.
The sinusoidal seeding/fitting algorithm and tonal/atonal audio splitting is supposed to produce two results:
The obvious drawback of using freeform sinusoidal estimation over a fixed-basis transform is increased complexity, and it's not yet clear that the increased complexity delivers sufficient benefit. Furthermore, although this initial experiment produced approximately the hoped for/expected results, other secondary concerns remain: