Optimising Neural Speech Codecs for 200bps Communication using Reinforcement Learning

Anonymous Authors

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

Anonymous Authors

Overview of the two-stage training framework of ClariCodec. In Stage 1, the full codec is trained end-to-end using a combination of L1 mel reconstruction loss, adversarial loss, and feature matching loss to ensure high-fidelity speech reconstruction. In Stage 2, all modules excpet the encoder are frozen, and the encoder is fine-tuned using an RL objective where the reward signal is derived from a pretrained ASR model, explicitly optimizing for speech intelligibility. An L_1 mel reconstruction loss is used for preventing perceptual degradation during RL optimization.

Abstract

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL) based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 200 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 8.93% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.

Main Results Samples

**Table 1.** Speech reconstruction samples in main results. Samples~1-3 are clean, Samples~4-6 are noisy.
	Ground Truth	ClariCodec w/o RL	ClariCodec	Encodec	StableCodec-700	FlexiCodec	SAC	WavTokenizer	Socodec	StableCodec-400	SemantiCodec
bitrate per second	-	300	300	750	700	640	525	480	466	400	312.5
sample1
sample2
sample3
sample4
sample5
sample6