Skip to content

Feature proposal: Ordered Base64 (OB64) encoding — 22-char sortable representation #54

@hongqn

Description

@hongqn

Summary

I'd like to propose adding an Ordered Base64 (OB64) encoding to python-ulid, which represents a ULID as a 22-character string that preserves lexicographic sort order — 4 characters shorter than the current Crockford Base32 representation (26 chars).

Motivation

Crockford Base32 is great, but 26 characters can be a real cost in high-density storage scenarios:

  • Database columns: a CHAR(22) vs CHAR(26) primary key saves ~15% in index size per row. At billions of rows this matters.
  • URLs / query strings: shorter IDs reduce payload size in APIs that embed IDs in paths or parameters.
  • Log lines: if you emit ULID in every log line, shorter is better.

Standard Base64 / Base64url are the obvious candidates for a more compact encoding, but they are not lexicographically ordered — sorting Base64url strings does not sort the underlying binary values. This breaks range queries, cursor pagination, and any system that relies on ID sort order.

Design

OB64 solves this by choosing a 64-character alphabet that is in strict ascending ASCII order, so that string comparison == integer comparison.

Alphabet (64 chars, all URL-safe, strictly ascending ASCII):

- 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z

(- = 45, 0–9 = 48–57, A–Z = 65–90, _ = 95, a–z = 97–122 — the only 64 printable, URL-safe ASCII characters that can be arranged in strict ascending order.)

Encoding: 128 bits packed MSB-first into 22 × 6-bit groups. The final character carries the 2 LSBs of the ULID + 4 zero-padding bits. Because the padding bits are always zero, sort order is preserved end-to-end.

Comparison table:

Encoding Length Bits/char Sorted? URL-safe?
Crockford Base32 (current) 26 chars 5
OB64 (proposed) 22 chars 6
Standard Base64 24 chars + == 6
Base64url (RFC 4648 §5) 22 chars 6

Implementation sketch

ulid = ULID()
ulid.ob64               # → 22-char OB64 string, e.g. '-Oo99nS99brQCjbT_fkYTk'
ULID.from_ob64(s)       # → round-trips back to the same ULID

The implementation delegates to C-accelerated base64.b64encode + bytes.translate (alphabet remapping), so there is no pure-Python bit-manipulation loop. On Apple Silicon / Python 3.13:

  • encode: ~3.9 M ops/s (vs ~8.5 M for raw b64encode)
  • decode: ~2.6 M ops/s (vs ~4.4 M for raw b64decode)

The overhead over raw stdlib is exactly the cost of the two additional C calls needed for alphabet remapping.

Questions for the maintainer

  1. Is this within scope for python-ulid, or do you prefer to keep the library strictly aligned with the ULID spec (Crockford Base32 only)?
  2. If in scope, would you prefer it as a method on ULID directly (ulid.ob64 / ULID.from_ob64), or as a separate codec utility (e.g. ulid.codecs.ob64)?
  3. Any concern about the - character being first in the alphabet (all-zero ULID encodes as ----------------------)?

I have a working implementation with tests, benchmark, and documentation ready — happy to open a PR if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions