Optimising Neural Speech Codecs for 200bps Communication using Reinforcement Learning

Anonymous Authors
ClariCodec Training Framework

Overview of the two-stage training framework of ClariCodec. In Stage 1, the full codec is trained end-to-end using a combination of L1 mel reconstruction loss, adversarial loss, and feature matching loss to ensure high-fidelity speech reconstruction. In Stage 2, all modules excpet the encoder are frozen, and the encoder is fine-tuned using an RL objective where the reward signal is derived from a pretrained ASR model, explicitly optimizing for speech intelligibility. An L_1 mel reconstruction loss is used for preventing perceptual degradation during RL optimization.

Abstract

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 200 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL) based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 3.68% WER on the LibriSpeech test-clean set at 200 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.20% on test-clean and 8.93% on test-other, corresponding to a 13% relative reduction while preserving perceptual quality.

Main Results Samples

Table 1. Speech reconstruction samples in main results. Samples~1-3 are clean, Samples~4-6 are noisy.
Ground Truth ClariCodec w/o RL ClariCodec Encodec StableCodec-700 FlexiCodec SAC WavTokenizer Socodec StableCodec-400 SemantiCodec LSCodec
bitrate per second - 200 200 750 700 640 525 480 466 400 312.5 250
sample1
sample2
sample3
sample4
sample5
sample6