-
Notifications
You must be signed in to change notification settings - Fork 29
Feature proposal: Ordered Base64 (OB64) encoding — 22-char sortable representation #54
Description
Summary
I'd like to propose adding an Ordered Base64 (OB64) encoding to python-ulid, which represents a ULID as a 22-character string that preserves lexicographic sort order — 4 characters shorter than the current Crockford Base32 representation (26 chars).
Motivation
Crockford Base32 is great, but 26 characters can be a real cost in high-density storage scenarios:
- Database columns: a
CHAR(22)vsCHAR(26)primary key saves ~15% in index size per row. At billions of rows this matters. - URLs / query strings: shorter IDs reduce payload size in APIs that embed IDs in paths or parameters.
- Log lines: if you emit ULID in every log line, shorter is better.
Standard Base64 / Base64url are the obvious candidates for a more compact encoding, but they are not lexicographically ordered — sorting Base64url strings does not sort the underlying binary values. This breaks range queries, cursor pagination, and any system that relies on ID sort order.
Design
OB64 solves this by choosing a 64-character alphabet that is in strict ascending ASCII order, so that string comparison == integer comparison.
Alphabet (64 chars, all URL-safe, strictly ascending ASCII):
- 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
(- = 45, 0–9 = 48–57, A–Z = 65–90, _ = 95, a–z = 97–122 — the only 64 printable, URL-safe ASCII characters that can be arranged in strict ascending order.)
Encoding: 128 bits packed MSB-first into 22 × 6-bit groups. The final character carries the 2 LSBs of the ULID + 4 zero-padding bits. Because the padding bits are always zero, sort order is preserved end-to-end.
Comparison table:
| Encoding | Length | Bits/char | Sorted? | URL-safe? |
|---|---|---|---|---|
| Crockford Base32 (current) | 26 chars | 5 | ✓ | ✓ |
| OB64 (proposed) | 22 chars | 6 | ✓ | ✓ |
| Standard Base64 | 24 chars + == |
6 | ✗ | ✗ |
| Base64url (RFC 4648 §5) | 22 chars | 6 | ✗ | ✓ |
Implementation sketch
ulid = ULID()
ulid.ob64 # → 22-char OB64 string, e.g. '-Oo99nS99brQCjbT_fkYTk'
ULID.from_ob64(s) # → round-trips back to the same ULIDThe implementation delegates to C-accelerated base64.b64encode + bytes.translate (alphabet remapping), so there is no pure-Python bit-manipulation loop. On Apple Silicon / Python 3.13:
- encode: ~3.9 M ops/s (vs ~8.5 M for raw
b64encode) - decode: ~2.6 M ops/s (vs ~4.4 M for raw
b64decode)
The overhead over raw stdlib is exactly the cost of the two additional C calls needed for alphabet remapping.
Questions for the maintainer
- Is this within scope for
python-ulid, or do you prefer to keep the library strictly aligned with the ULID spec (Crockford Base32 only)? - If in scope, would you prefer it as a method on
ULIDdirectly (ulid.ob64/ULID.from_ob64), or as a separate codec utility (e.g.ulid.codecs.ob64)? - Any concern about the
-character being first in the alphabet (all-zero ULID encodes as----------------------)?
I have a working implementation with tests, benchmark, and documentation ready — happy to open a PR if there's interest.