This is an overview of the LPCNet algorithm. The left part of the network (yellow) is computed once per frame and its result is held constant throughout the frame for the sample rate network on the right (blue). The compute prediction block predicts the sample at time t based on previous samples and on the linear prediction coefficients.
Neural speech synthesis models have recently demonstrated the ability to synthesize high quality speech for text-to-speech and compression applications. These new models often require powerful GPUs to achieve real-time operation, so being able to reduce their complexity would open the way for many new applications. We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis. We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS. This makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.
MUSHRA Test Results
We conducted a subjective listening test with a MUSHRA-derived methodology, where 8 utterances (2 male and 2 female speakers) were each evaluated by 100 participants. The results below show that the quality of LPCNet significantly exceeds that of WaveRNN at equal complexity. Alternatively, it shows that the same quality is possible at a significantly reduced complexity.
Subjective quality (MUSHRA) results as a function of the number of units in the main GRU.
Hear For Yourself
Here are two of the samples that were used in the listening test above.
Select where to start playing when selecting a new sample
Player will continue when changing sample.
Comparing the speech synthesis quality of LPCNet with that of WaveRNN+.
This demo will work best with
a browser that supports Ogg/Opus in HTML5
(Firefox, Chrome and Opera do), but if Opus support is missing the file will be played as FLAC, WAV, or high bitrate MP3.
Kalchbrenner, N. and Elsen, E. and Simonyan, K. and Noury, S. and Casagrande, N. and Lockhart, E. and Stimberg, F. and van den Oord, A. and Dieleman, S. and Kavukcuoglu, K., Efficient Neural Audio Synthesis, 2018.