]> Ghost update 20111117: Chirp Selection and Tracking

Ghost update 20111117

The first Ghost update was an introduction and overview of the underlying codec structure under research for Ghost. The codec's central novelty is that Ghost is to be a hybrid time/frequency codec that splits and separately encodes tonal and atonal content. This is not a new idea, though it has not yet been successfully deployed in a practical, general purpose codec for several good reasons. Part of the purpose of Ghost is to reevaluate hybrid coding to see if any new discoveries make the hybrid approach practical or advantageous.

The first demo already showed rudimentary sinusoidal estimation and fitting in action. Demo two demonstrated that our sinusoidal estimator was practically capable of tracking an isolated log-chirp. Now we get started on the hard part: using the sinusoidal estimator as part of a larger system that detects and tracks a large number of simultaneous sinusoids/chirps in real audio, analyzes the set for structure, evaluates for perceptual importance, and then splits the audio into fundamentally frequency-domain data and fundamentally time-domain data.

Sinusoidal seeding

The first step in splitting out our sinusoids is to initially detect the sinusoids. Conceptually this is relatively straightforward, but we're constrained by practical performance issues. Below are a few simple techniques explored so far to find obvious tones as well as ferret out tones hidden in noise and other spectral structures.

Bark-band energy thresholding

This is a relatively naive technique that:

first computes the log-amplitude FFT of a given sample window.
Then computes the relative bin-by-bin masking energy in a critical band (assuming that the energy contributed by both tones and noise is equivalent for masking; it isn't quite, but it's a good first approximation).
Strong sinusoids that straddle basis frequencies in an FFT will split most of their energy into the two immediately adjacent bins, so we next 'consolidate' energy in frequency bins by summing energy in adjacent bins.
We then pick any peaks as candidate 'seeds' that sit at least some fixed distance above (or below) the energy. Tone masking of wideband noise is a well-studied subject and plenty of hard data exists with which to ground such a fixed threshold (eg, Riesz (1928), Fletcher (1940), Miller (1947), Jesteadt (1977), Viemeister (1988), etc)

Though I labeled this technique 'naive', it has solid empirical grounding as a good baseline for usable performance and as such it's the starting point many other techniques build upon. It can be further augmented easily with perceptual weighting, tonality estimation, harmonic analysis and so on. For example, we may try to look for and 'fill in' weak harmonics such as were not picked in step 4.

Tonality estimation

Empirical evidence from past codec development (eg, Speex and Vorbis) suggests that the critical band energy curve computed as step 2 above benefits from further modification to be more useful as a masking curve. In strongly tonal areas the energy curve rides well above the noise floor, while in atonal areas of the spectrum it tends to sit well under the audible noise. As such, noisy portions of the spectrum may trigger a substantial number of single-frame false positives as spurious peaks exceed the fixed sinusoid detection threshold.

Addition of a 'tonality estimation' may be useful to bias sinusoid selection toward or against noisy areas. Though it's trivial to achieve the desired amount of noise rejection via such a weighting, quantifying what the desired level actually is requires listening testing; this is the next step of research and not yet performed. The curve computation being used by Ghost is not equivalent to the curve used in Vorbis, and as such, empirical data can't be directly transferred.

Phase-Coherence-Like Weighting

Proper 'Phase Coherence' analysis performs a least-squares amplitude/phase fit for a given frequency at fixed intervals in time, then computes the 'coherence' as the inverse of the variance of expected phase over the values in a given time period. The thought is that sinusoids may be more obviously visible via streaks of coherent phase (low phase variance) even when noise hides low amplitude but steady tones.

Phase coherence computation has the same bootstrapping problem we're trying to solve: What frequencies to select for fitting? The 'Phase Coherence Spectrum' uses a fixed spacing, but this is not really appropriate here; we'd be duplicating some of what our sinusoidal estimator does, but in a more limited way for a high computational cost. What we want is to extract some of the same information, but at as low a cost as possible so that the expensive sinusoidal estimation converges as quickly as possible.

Thus we modify the phase coherence technique to track the relative phase variance of the individual bins of an FFT across time, using the bins with low phase variance to choose candidate sinusoids. What we end up with is akin to a bank of resonators that integrate coherent sinusoidal energy over a period of time.

Unfortunately, multiple practical problems render phase coherence a poor metric for sinusoid selection. A few problems include:

Our modified technique is badly limited tracking signals with varying frequency as it is capable of watching phase variation only within a single bin. Even mild frequency ramps are essentially invisible. Frequency ramps are more common than one might think, being an essential feature of voice and many instruments. Although I suspect that our own auditory processing machinery is doing a very similar transform early in the perception chain, it is almost certainly collecting and filtering data from many 'bins' at a time. Once again the benefits of slow but massively parallel processing are hard to replicate on a normal CPU.
Strong reverberation badly perturbs phase in real signals, whether the reverberation is natural or manufactured. The perturbation appears in the form of noise and can easily completely hide tonal information in the phase domain that is still readily apparent in the energy domain.

In short, phase coherence hasn't yet proven itself to find useful sinusoidal information that isn't already apparent to simple energy thresholding. Looking at variance over long periods has shown no demonstrable advantage over simply using a longer time window to an FFT.

This does bring up interesting questions relating to perception and efficient coding of reverberation; I'm unaware of preexisting research on the subject, at least in the context of auditory modeling.

Cepstral Analysis

Cepstral analysis techniques (there are many variants) essentially look for strong harmonic structures in a frequency spectrum by doing a Fourier transform (some use forward, some backward) on a log-amplitude version of the frequency spectrum. A regular harmonic sequence is a repeating pattern of amplitude spikes; this isn't a periodic signal, but it behaves enough like one such that the further transform consolidates the harmonics into a single, strong peak in the 'Cepstral' domain.

The Cepstrum is especially valuable for analyzing voice, where it excels as finding the fundamental and basic formant structure of vowels and voiced consonants. Unfortunately, the cepstrum is only good at picking out a single harmonic structure; multi-voice sounds substantially impair its sensitivity (it can detect both, but with far less margin against noise) and pure sinusoids are completely invisible to the technique.

As expected, cepstral analysis excels at finding the fundamentals in voice samples. It's not clear that it's more useful than naive thresholding elsewhere.

Multi-scale thresholding

Short tone bursts aren't perceived as tones, but rather as clicks or bands of noise with tonal coloration. The transition from click to tone is gradual and frequency dependent.

This suggests that acceptable sinusoidal seeding and tracking will need to be performed hierarchically, with higher frequencies using faster detection/tracking, and lower frequencies using slower tracking.

Though conceptually unrelated, multi-scale analysis should also be tested using slow, long-window FFTs to better see faint sinusoids hidden in noise. It's possible that sinusoidal sensation may not be a binary quantity, but that long tone bursts would be audible at much lower thresholds than short tone bursts (I expect some research has been done on this, I just need to find it). Do these tones contribute to sinusoidal sensation, or only to narrowband noise color?

Sinusoidal tracking

As explained in demo 2, the mechanism by which we fit chirps also becomes the mechanism by which we track chirps from frame to frame. We can extrapolate chirps forward (and backward) seamlessly following sinusoidal energy as it moves between frames and FFT bins. Forward extrapolation of a chirp from a preceding frame into the new frame becomes the initial estimate to the chirp fit estimation algorithm. This eliminates the need to search for the chirp all over again in a new frame, and then try to match it to a preexisting chirp from earlier.

Naturally, the only question is whether this technique can actually be made to work in practice on real audio.

Initial trials: Compmono

At the left is an initial test run using compmono.wav. The audio track is the result synthesized from the sinusoids chosen and tracked. The video portion is a visualization of the tracking data. The top half of the pane is a rolling spectrogram with overlaid cyan lines showing the continuously tracked frequencies of each chirp. The lower half of the pane is the instantaneous spectrum synchronized to the audio. The green line is the FFT of the original audio, and each white 'lollipop' is a chirp that's being tracked.

This test was performed using a naive spreading function without further analysis (plotted as the white curve in the lower half of the telemetry video to the left). Any peak more than 4dB above the spreading function is seeded to the chirp tracking list. A seeded peak that tracks above the spreading function level for 100ms is preserved and output. A chirps is tracked until such time as it drops below the spreading curve and is weeded from the list.

Several things are worth noting:

Thresholding via an unweighted spreading function shows mediocre discrimination between tonal and noisy regions of the spectrum. Many obvious sinusoids are missed, many tracked sinusoids are inadvertently weeded, and noise causes significant spurious seeding. Cymbals especially cause trouble here. This will be an area to improve.
Despite seeding/weeding misbehavior caused by the naive thresholding, the sinusoidal estimator shows promising locking and tracking ability. It follows sliding notes and meandering voice harmonics well, and functions as hoped in a crowded spectrum.
Although we seek to improve detection and tracking to get the best possible tonal/atonal splitting performance, perfect splitting is not necessary and possibly not desirable in the codec. The purpose of splitting out strong tones is to represent tonal energy contributions compactly. Any 'missed' energy is passed along to the time-domain encoding.

For completeness, below are the original compmono track as well as the sinusoidal and 'residue' audio remaining after the sinusoidal content is subtracted.

Compmono test audio

Original audio Sinusoids only Residual audio

Download: [ Original audio | Sinusoids only | Residual audio |

In summary, a promising first attempt.

Initial trials: Sburb

Another test, another test sample. Deeply layered electronic music (synthetic), crisp but subtle textures, a few very strong pure tones.

[BTW, go read Homestuck!]

Sburb test audio

Original audio Sinusoids only Residual audio

Download: [ Original audio | Sinusoids only | Residual audio |

Initial trials: Voice

Voice is one of my primary motivations for a parametric harmonic sinusoidal model; voice is very strongly and regularly harmonic and so it should model and code well. However, the hard threshold based only on a spreading function falls rather flat here. Although the estimator tracks the vocal harmonics nicely where it's seeded, the seeding is hitting correctly at best 50% of the time.

Voice test audio

Original audio Sinusoids only Residual audio

Download: [ Original audio | Sinusoids only | Residual audio |

Concerns

The sinusoidal seeding/fitting algorithm and tonal/atonal audio splitting is supposed to produce two results:

Improved energy compaction over a fixed-basis transform (eg, MDCT)
Automatic tracking of strong tonal energy from frame to frame, enabling INTER-style audio frames

The obvious drawback of using freeform sinusoidal estimation over a fixed-basis transform is increased complexity, and it's not yet clear that the increased complexity delivers sufficient benefit. Furthermore, although this initial experiment produced approximately the hoped for/expected results, other secondary concerns remain:

The minimum possible spacing between discernable chirps is partly a function of block size. The experiment above used substantial look-behind to run estimation on 8192-sample blocks in order to achieve sufficient chirp separation in crowded parts of the spectrum near DC. This block size is very large compared to a typical usage and linearly impacts running time. It is likely some way will need to be found to use smaller fitting block-sizes, at least in the majority of the spectrum (eg, the multiresolution analysis mentioned above).
Even with the very large estimation blocksize, tone detection and tracking in the bass is poor in the above trials. It's not clear how much of the problem is due to the naive spreading function and how much is due to deficiencies in estimator tracking.

Ghost development work is sponsored by Red Hat Emerging Technologies.
This page (C) Copyright 2011 Red Hat Inc. and Monty
Cheers!