From base model to deployed reasoning agent - every LLM post-training technique, implemented and benchmarked.
A Python package implementing the complete LLM post-training pipeline: Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO) with verifiable math rewards, and Direct Preference Optimization (DPO). Includes evaluation benchmarks via lm-evaluation-harness and multi-backend inference serving (Unsloth, vLLM, MLX). Built for learning and demonstration, designed to run on free Colab GPUs with QLoRA and Unsloth for memory-efficient training on Qwen2.5-3B.
graph LR
A[Qwen2.5-3B<br/>Base Model] --> B[SFT<br/>Instruction Following]
B --> C[GRPO<br/>Math Reasoning via RL]
B --> D[DPO<br/>Preference Alignment]
C --> E[Evaluation<br/>GSM8K, MATH, ARC]
D --> E
E --> F[Inference<br/>Unsloth / vLLM / MLX]
# Install
pip install git+https://github.com/sacredvoid/alignrl.git
# Train (SFT as an example)
alignrl train sft -c configs/sft.yaml
# Evaluate
alignrl eval --adapter ./outputs/sft/final --stage sft
# Launch comparison demo
alignrl serve --stages base sft=./outputs/sft/final grpo=./outputs/grpo/finalFor GPU training, install with the train and unsloth extras:
pip install "alignrl[train,unsloth] @ git+https://github.com/sacredvoid/alignrl.git"Each notebook is self-contained and runs end-to-end on a free Colab T4 GPU.
All evaluations run on Qwen2.5-3B with QLoRA adapters. Best score per benchmark in bold.
| Benchmark | Metric | Base | SFT | GRPO | DPO |
|---|---|---|---|---|---|
| GSM8K | exact_match | 0.31 | 0.45 | 0.62 | 0.43 |
| MATH | exact_match | 0.12 | 0.18 | 0.29 | 0.17 |
| ARC-Challenge | acc_norm | 0.48 | 0.54 | 0.52 | 0.55 |
Key takeaways:
- GRPO dominates math reasoning - GSM8K jumps from 31% to 62% (2x), MATH from 12% to 29% (2.4x)
- DPO edges out on general reasoning - ARC-Challenge best at 55%, suggesting preference alignment improves broad task quality
- SFT is a strong baseline - consistent improvement across all benchmarks before any RL
| Module | Purpose | Key Class |
|---|---|---|
alignrl.sft |
Supervised Fine-Tuning with QLoRA | SFTRunner |
alignrl.grpo |
RL with Verifiable Math Rewards | GRPORunner |
alignrl.dpo |
Direct Preference Optimization | DPORunner |
alignrl.eval |
Benchmark evaluation harness | EvalRunner |
alignrl.inference |
Multi-backend model serving | ModelServer |
alignrl.rewards |
Math reward verifiers for GRPO | math_verify_reward |
alignrl.demo |
Gradio comparison UI | create_demo |
alignrl.cli |
CLI entry point (train, eval, serve) |
main |
alignrl.config |
Pydantic-validated training configs | BaseTrainConfig |
alignrl.types |
Shared protocols and result types | Trainer, TrainResult, EvalResult |
The codebase follows a few core design decisions:
- Pydantic configs - Every training stage uses a typed config class inheriting from
BaseTrainConfig, loadable from YAML files. Validation happens at construction time, not at training time. - Common Trainer protocol -
SFTRunner,GRPORunner, andDPORunnerall implement theTrainerprotocol (train(),save(),load()), making them interchangeable in pipelines and tests. - Lazy imports - Heavy dependencies (torch, transformers, unsloth, vllm, mlx-lm) are imported inside methods, not at module level. The base package installs in seconds with just pydantic and pyyaml.
- Unsloth for speed - All training uses Unsloth's
FastLanguageModelwith gradient checkpointing, cutting VRAM usage roughly in half compared to vanilla transformers. Fits Qwen2.5-3B training on a free Colab T4 (16GB). - Structured results - Training returns
TrainResult, evaluation returnsEvalResult. Both are frozen dataclasses that serialize to JSON for the results dashboard.
alignrl/
configs/ # YAML configs for each training stage
docs/ # GitHub Pages results dashboard
notebooks/ # Colab-ready Jupyter notebooks
results/ # Benchmark JSON (consumed by dashboard)
src/alignrl/ # Package source
tests/ # 49 unit tests (pytest)
pyproject.toml # Hatchling build, optional dependency groups
| Category | Tools |
|---|---|
| Training | TRL, Unsloth, PEFT, bitsandbytes |
| Evaluation | lm-evaluation-harness |
| Inference | vLLM, MLX-LM, Unsloth |
| Demo | Gradio |
| Config | Pydantic, PyYAML |
| Quality | Ruff, mypy, pytest |