alignrl

From base model to deployed reasoning agent - every LLM post-training technique, implemented and benchmarked.

What is this?

A Python package implementing the complete LLM post-training pipeline: Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO) with verifiable math rewards, and Direct Preference Optimization (DPO). Includes evaluation benchmarks via lm-evaluation-harness and multi-backend inference serving (Unsloth, vLLM, MLX). Built for learning and demonstration, designed to run on free Colab GPUs with QLoRA and Unsloth for memory-efficient training on Qwen2.5-3B.

Pipeline

graph LR
    A[Qwen2.5-3B<br/>Base Model] --> B[SFT<br/>Instruction Following]
    B --> C[GRPO<br/>Math Reasoning via RL]
    B --> D[DPO<br/>Preference Alignment]
    C --> E[Evaluation<br/>GSM8K, MATH, ARC]
    D --> E
    E --> F[Inference<br/>Unsloth / vLLM / MLX]

Quick Start

# Install
pip install git+https://github.com/sacredvoid/alignrl.git

# Train (SFT as an example)
alignrl train sft -c configs/sft.yaml

# Evaluate
alignrl eval --adapter ./outputs/sft/final --stage sft

# Launch comparison demo
alignrl serve --stages base sft=./outputs/sft/final grpo=./outputs/grpo/final

For GPU training, install with the train and unsloth extras:

pip install "alignrl[train,unsloth] @ git+https://github.com/sacredvoid/alignrl.git"

Notebooks

Each notebook is self-contained and runs end-to-end on a free Colab T4 GPU.

#	Notebook	Technique
01	SFT on OpenHermes-2.5	Supervised Fine-Tuning with QLoRA
02	GRPO on GSM8K	RL with Verifiable Math Rewards
03	DPO on UltraFeedback	Direct Preference Optimization
04	Benchmark Evaluation	lm-evaluation-harness across stages
05	Inference Comparison	Side-by-side Gradio demo

Benchmark Results

All evaluations run on Qwen2.5-3B with QLoRA adapters. Best score per benchmark in bold.

Benchmark	Metric	Base	SFT	GRPO	DPO
GSM8K	exact_match	0.31	0.45	0.62	0.43
MATH	exact_match	0.12	0.18	0.29	0.17
ARC-Challenge	acc_norm	0.48	0.54	0.52	0.55

Key takeaways:

GRPO dominates math reasoning - GSM8K jumps from 31% to 62% (2x), MATH from 12% to 29% (2.4x)
DPO edges out on general reasoning - ARC-Challenge best at 55%, suggesting preference alignment improves broad task quality
SFT is a strong baseline - consistent improvement across all benchmarks before any RL

Module Reference

Module	Purpose	Key Class
`alignrl.sft`	Supervised Fine-Tuning with QLoRA	`SFTRunner`
`alignrl.grpo`	RL with Verifiable Math Rewards	`GRPORunner`
`alignrl.dpo`	Direct Preference Optimization	`DPORunner`
`alignrl.eval`	Benchmark evaluation harness	`EvalRunner`
`alignrl.inference`	Multi-backend model serving	`ModelServer`
`alignrl.rewards`	Math reward verifiers for GRPO	`math_verify_reward`
`alignrl.demo`	Gradio comparison UI	`create_demo`
`alignrl.cli`	CLI entry point (`train`, `eval`, `serve`)	`main`
`alignrl.config`	Pydantic-validated training configs	`BaseTrainConfig`
`alignrl.types`	Shared protocols and result types	`Trainer`, `TrainResult`, `EvalResult`

Architecture

The codebase follows a few core design decisions:

Pydantic configs - Every training stage uses a typed config class inheriting from BaseTrainConfig, loadable from YAML files. Validation happens at construction time, not at training time.
Common Trainer protocol - SFTRunner, GRPORunner, and DPORunner all implement the Trainer protocol (train(), save(), load()), making them interchangeable in pipelines and tests.
Lazy imports - Heavy dependencies (torch, transformers, unsloth, vllm, mlx-lm) are imported inside methods, not at module level. The base package installs in seconds with just pydantic and pyyaml.
Unsloth for speed - All training uses Unsloth's FastLanguageModel with gradient checkpointing, cutting VRAM usage roughly in half compared to vanilla transformers. Fits Qwen2.5-3B training on a free Colab T4 (16GB).
Structured results - Training returns TrainResult, evaluation returns EvalResult. Both are frozen dataclasses that serialize to JSON for the results dashboard.

Project Structure

alignrl/
  configs/          # YAML configs for each training stage
  docs/             # GitHub Pages results dashboard
  notebooks/        # Colab-ready Jupyter notebooks
  results/          # Benchmark JSON (consumed by dashboard)
  src/alignrl/      # Package source
  tests/            # 49 unit tests (pytest)
  pyproject.toml    # Hatchling build, optional dependency groups

Tech Stack

Category	Tools
Training	TRL, Unsloth, PEFT, bitsandbytes
Evaluation	lm-evaluation-harness
Inference	vLLM, MLX-LM, Unsloth
Demo	Gradio
Config	Pydantic, PyYAML
Quality	Ruff, mypy, pytest

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
examples		examples
notebooks		notebooks
results		results
src/alignrl		src/alignrl
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
README.md		README.md
autoresearch-results.tsv		autoresearch-results.tsv
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

alignrl

What is this?

Pipeline

Quick Start

Notebooks

Benchmark Results

Module Reference

Architecture

Project Structure

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

alignrl

What is this?

Pipeline

Quick Start

Notebooks

Benchmark Results

Module Reference

Architecture

Project Structure

Tech Stack

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages