Automatic speech recognition using Diffusion language models with uncertainty quantification

Traditional ASR models like Zipformer, Conformer, ESPNet, Whisper and wav2vec are sequence to sequence models. They are autoregressive models that predict the next token in the sequence. Such models have a bottleneck with inference speeds when transcribing long sentences.

Large Language Diffusion Models (LLaDA) [18 oct 2025], is a diffusion model that generates language as a probabilistic inference. It's a masked language model that uses iterative denoising process to generate tokens parallel in contrast to sequence to sequence models. It is pretrained for masked prediction by a transformer model.

Once pretrained, it follows a supervised finetuning (SFT) where the dataset is in the form of prompts and responses. The responses are masked using a probability distribution and then the loss is computed only on the masked predictions. During the inference process, we start with a completely masked response and iteratively denoise the predicted tokens over sampling steps.

In this project, we adapt LLaDa's supervised finetuning protocol for ASR, and generate transcriptions for a referenced speaker using denoising process.

This project is in 3 parts:

Contrastive pretraining for speaker identification
SFT for conditional transcription from noisy audio
Uncertainty quantification in generated transcription

We utilise audio features in place of prompts, and transcripts in place od responses and do masked modelling. The architecture is as follows:

Contrastive pre training for speaker identification

Initially, we train a transformer model for speaker identity embedding. We use a contrastive learning method to bring audio features of the same speaker closer to each other in the euclidean space and features of different speakers away from wach other with a margin of 1. We use siamese network to obtain anchor features, positive and negative features as below.

SFT mask prediction from noisy audio

A noisy input is generated by sampling the signal of the reference speaker and sampling several other speakers and backgrounds and mixing them at differenct signal to noise ratios. The conditional unmasking algorithm takes a noisy input. The noisy input is a audio of teh reference speaker mixed with several other speakers with different signal to noise ratios (-20 dB to 5 dB). We give the network a condition as speaker feature to decode and transcribe only what that speaker is saying. This is done as below.

Uncertainty quantification in generated transcription

Given the input audio features are a mixuture of signals and noise, the output will also be noisy. Traditionally, the outputs are assumed to be $X$, but in thios case we assume it to be $\tilde{X} = X + \epsilon$, where $\epsilon$ is assumed to be normally distributed. Inoreder to achieve this, we run a parameterised transformer that predicts $\mu$ and $\sigma$. We do variational inference by taking a sample from $\sim N (\mu, \sigma)$ and getting the logits. While inference, the $\mu$ gives the predicted logits and $\sigma$ gives the associatd uncertainty. This is shown below. => See inference with uncertainty heatmap here <=

Additionally, we adapt the scaled loss in Algorithm 2 to optimise our network and showcase that transcriiption is possible using diffusion protocol.

Pipeline and Contributions

Audio file is processed by wav2vec from facebook to give us audio_features [bs, h, 768]
These audio_features are then processes for padding by randomly sampling num window_size indices < h and sorting them to get src [bs, window_size, 768]. There are other methods to process but we did this to take temporally coherent and dropped features.
The transcription is tokenized and padded to transcription_length to get input_ids.
The input*ids are masked using a Bernoulli distribution to get masked_ids. The scr and masked_ids are the inputs to the transformer model.
Loss is computed between the predicted masked tokens and ground truth.
Grandient clipping is done for stability purposes.

Implementation Highlights

Component	Status	Description
Transformer Model	Self	Full implementation of Attention and Positional Encoding.
Masking Strategy	Self	Bernoulli-based masking for diffusion denoising.
Scaled Loss	Algorithm 2	Adapted from LLaDA for Supervised Fine-Tuning (SFT).

Diffusion ASR Training (Docker Compose)

This project contains a self-contained Docker environment for training an ASR (Automatic Speech Recognition) diffusion model. All dependencies, code, and logs live inside the container, so no host mounting is required.

You can run training interactively and save metrics/loss plots as SVG images.

Prerequisites

Linux with NVIDIA GPU + Docker + NVIDIA Container Toolkit
docker & docker-compose installed
CUDA 12.4 compatible GPU drivers

Build Docker Image

docker compose build

This will copy the mini dataset and its processed .pt files to the /app folder.

Run Interactive Container

docker compose run --rm asr

You will get a bash prompt inside the container at /app. All code, logs, and checkpoints are inside the container.

Once inside the interactive container shell, to run training,

root@xxxxxx:/app# python3 train.py

To run inference, change the audio path, steps, and the checkpoint in the inference.py. Then run

root@xxxxxx:/app# python3 inference.py

Training loss and metric curves

Training Loss

Character Error Rate (CER)

Word Error Rate (WER)

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Dataset		Dataset
images		images
models		models
processed_data		processed_data
video		video
.DS_Store		.DS_Store
Dockerfile		Dockerfile
README.md		README.md
config.yaml		config.yaml
dataloader.py		dataloader.py
dataprocessor.py		dataprocessor.py
docker-compose.yml		docker-compose.yml
inference.py		inference.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic speech recognition using Diffusion language models with uncertainty quantification

Contrastive pre training for speaker identification

SFT mask prediction from noisy audio

Uncertainty quantification in generated transcription

Pipeline and Contributions

Implementation Highlights

Diffusion ASR Training (Docker Compose)

Prerequisites

Build Docker Image

Run Interactive Container

Training loss and metric curves

ASR inference using 3000 denoising steps with Uncertainty heatmap

ASR inference using 3000 denoising steps

ASR inference using 100 denoising steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Automatic speech recognition using Diffusion language models with uncertainty quantification

Contrastive pre training for speaker identification

SFT mask prediction from noisy audio

Uncertainty quantification in generated transcription

Pipeline and Contributions

Implementation Highlights

Diffusion ASR Training (Docker Compose)

Prerequisites

Build Docker Image

Run Interactive Container

Training loss and metric curves

ASR inference using 3000 denoising steps with Uncertainty heatmap

ASR inference using 3000 denoising steps

ASR inference using 100 denoising steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages