Traditional ASR models like Zipformer, Conformer, ESPNet, Whisper and wav2vec are sequence to sequence models. They are autoregressive models that predict the next token in the sequence. Such models have a bottleneck with inference speeds when transcribing long sentences.
Large Language Diffusion Models (LLaDA) [18 oct 2025], is a diffusion model that generates language as a probabilistic inference. It's a masked language model that uses iterative denoising process to generate tokens parallel in contrast to sequence to sequence models. It is pretrained for masked prediction by a transformer model.
Once pretrained, it follows a supervised finetuning (SFT) where the dataset is in the form of prompts and responses. The responses are masked using a probability distribution and then the loss is computed only on the masked predictions. During the inference process, we start with a completely masked response and iteratively denoise the predicted tokens over sampling steps.
In this project, we adapt LLaDa's supervised finetuning protocol for ASR, and generate transcriptions for a referenced speaker using denoising process.
This project is in 3 parts:
- Contrastive pretraining for speaker identification
- SFT for conditional transcription from noisy audio
- Uncertainty quantification in generated transcription
We utilise audio features in place of prompts, and transcripts in place od responses and do masked modelling. The architecture is as follows:
Initially, we train a transformer model for speaker identity embedding. We use a contrastive learning method to bring audio features of the same speaker closer to each other in the euclidean space and features of different speakers away from wach other with a margin of 1. We use siamese network to obtain anchor features, positive and negative features as below.
A noisy input is generated by sampling the signal of the reference speaker and sampling several other speakers and backgrounds and mixing them at differenct signal to noise ratios. The conditional unmasking algorithm takes a noisy input. The noisy input is a audio of teh reference speaker mixed with several other speakers with different signal to noise ratios (-20 dB to 5 dB). We give the network a condition as speaker feature to decode and transcribe only what that speaker is saying. This is done as below.
Given the input audio features are a mixuture of signals and noise, the output will also be noisy. Traditionally, the outputs are assumed to be
Additionally, we adapt the scaled loss in Algorithm 2 to optimise our network and showcase that transcriiption is possible using diffusion protocol.
- Audio file is processed by wav2vec from facebook to give us
audio_features[bs, h, 768] - These audio_features are then processes for padding by randomly sampling num
window_sizeindices < h and sorting them to getsrc[bs, window_size, 768]. There are other methods to process but we did this to take temporally coherent and dropped features. - The transcription is tokenized and padded to transcription_length to get
input_ids. - The input*ids are masked using a Bernoulli distribution to get
masked_ids. The scr andmasked_idsare the inputs to the transformer model. Lossis computed between the predicted masked tokens and ground truth.- Grandient clipping is done for stability purposes.
| Component | Status | Description |
|---|---|---|
| Transformer Model | Self | Full implementation of Attention and Positional Encoding. |
| Masking Strategy | Self | Bernoulli-based masking for diffusion denoising. |
| Scaled Loss | Algorithm 2 | Adapted from LLaDA for Supervised Fine-Tuning (SFT). |
This project contains a self-contained Docker environment for training an ASR (Automatic Speech Recognition) diffusion model. All dependencies, code, and logs live inside the container, so no host mounting is required.
You can run training interactively and save metrics/loss plots as SVG images.
- Linux with NVIDIA GPU + Docker + NVIDIA Container Toolkit
docker&docker-composeinstalled- CUDA 12.4 compatible GPU drivers
docker compose buildThis will copy the mini dataset and its processed .pt files to the /app folder.
docker compose run --rm asrYou will get a bash prompt inside the container at /app. All code, logs, and checkpoints are inside the container.
Once inside the interactive container shell, to run training,
root@xxxxxx:/app# python3 train.pyTo run inference, change the audio path, steps, and the checkpoint in the inference.py. Then run
root@xxxxxx:/app# python3 inference.py








