Performance: ByteTables + targeted byte-walking for hot parse paths by cpakman · Pull Request #2070 · Shopify/liquid

cpakman · 2026-04-05T19:35:36Z

Summary

Add pre-computed byte lookup tables (ByteTables) and apply byte-walking to three hot parsing paths, replacing regex matching and StringScanner usage in the most frequently called parse methods.

+208 / -61 across 5 files (net +147 lines of production code).

Results (Ruby 4.0.2 + YJIT, theme benchmark)

Metric	main	this branch	Δ
Parse	6,426µs	5,529µs	-14%
Render	1,557µs	1,516µs	-3%
Combined	7,983µs	7,045µs	-11%
Allocations	60,811	51,620	-15%

What changed

1. `ByteTables` module (new, 44 lines)

Pre-computed 256-entry boolean arrays for byte classification: IDENT_START, IDENT_CONT, WORD, DIGIT, WHITESPACE. A single array index (TABLE[byte]) replaces 3-5 chained comparisons per byte check. Built once at load time, frozen.

2. `Expression.parse_number` — byte-walk instead of regex + StringScanner

Replaced INTEGER_REGEX/FLOAT_REGEX matching and the StringScanner byte loop with a single forward pass using ByteTables::DIGIT. Avoids MatchData allocation and StringScanner reset per call. Handles all edge cases: negative numbers, multi-dot floats (1.2.3.4), trailing dots (123.), and rejects trailing non-numeric bytes (1.2.3a).

Also guards Expression.parse's String#strip call — only allocates when whitespace is actually present (~4,464 avoided allocations per compile).

3. `VariableLookup` — fast path for simple identifier chains

SIMPLE_LOOKUP_RE validates that input is a plain a.b.c chain (no brackets, no quotes — ~90% of real-world lookups). On match, byte-walks on dots instead of invoking the recursive VariableParser regex (/\[(?>[^\[\]]+|\g<0>)*\]|[\w-]+\??/). Falls through to original path for complex inputs.

4. `BlockBody.try_parse_tag_token` — byte-walk tag tokens

Parses {%...%} tokens using getbyte/byteslice + ByteTables instead of the FullToken regex with 4 capture groups. Allocates only the 2 strings needed (tag_name, markup) vs 4+ from regex captures. Uses ByteTables::WORD (no hyphen) for tag name scanning, matching TagName = /#|\w+/ exactly. Falls back to FullToken regex on nil.

Design principles

Every fast path has a regex fallback — if the byte-walker returns nil/false, the original regex path runs. Zero risk for edge cases.
No new abstractions — no Cursor class, no parallel parse pipeline. Just targeted byte-walking at three specific call sites.
Each commit is independently revertable — one file per optimization commit.

Review process

This was developed iteratively with multi-agent code review covering:

Correctness: Found and fixed 3 issues (hyphen in tag names via IDENT_CONT, ? suffix in tag names, multi-dot trailing alpha in parse_number)
Security: Found and fixed null byte gap in WHITESPACE table vs String#strip
Performance: Confirmed match? + byte-walk is near-optimal (pure byte-walk is 2.8× slower than regex for validation)
Simplicity: Named constants, structural comments, consistent patterns across all three sites
Rubocop: 0 offenses

ByteTables provides pre-computed 256-entry boolean lookup arrays for byte classification (IDENT_START, IDENT_CONT, WORD, DIGIT, WHITESPACE) and named constants for delimiter bytes (NEWLINE, DASH, DOT, HASH). bench_quick.rb measures parse µs, render µs, and object allocations for the theme benchmark suite.

parse_number: replace INTEGER_REGEX/FLOAT_REGEX matching and StringScanner loop with a single byte-walking pass using ByteTables::DIGIT. Avoids MatchData allocation and StringScanner reset on every call. Expression.parse: only call String#strip when leading/trailing whitespace is actually present (checked via ByteTables::WHITESPACE). Avoids allocating a new String on ~4,464 calls per compile.

Skip the expensive recursive VariableParser regex for simple lookups like 'product.title' (~90% of real-world cases). SIMPLE_LOOKUP_RE validates the input is a plain a.b.c chain (no brackets, no quotes). On match, byte-walks on dots to split segments instead of invoking the regex engine. Falls through to the original VariableParser scan for complex inputs.

Add try_parse_tag_token that parses {%...%} tag tokens using getbyte/byteslice + ByteTables lookup arrays instead of the FullToken regex with 4 capture groups. Allocates only the 2 strings needed (tag_name, markup) vs 4+ from regex captures. Uses ByteTables::WORD (no hyphen) for tag name scanning, matching TagName = /#|\w+/ exactly. Falls back to FullToken regex when the fast path returns nil.

36 tests covering the three optimization sites: Expression.parse_number (13 tests): - Simple integers, negatives, floats, trailing dots - Multi-dot truncation (1.2.3 → 1.2) - Rejection of non-numeric input and trailing alpha (1.2.3a) Expression.parse strip guard (7 tests): - Leading, trailing, both-sides whitespace - No-strip-needed case (no allocation) - Null byte stripping (matches String#strip behavior) VariableLookup.simple_lookup? (8 tests): - Accepts: single names, dotted chains, question marks, hyphens - Rejects: brackets, empty, leading/trailing dots, double dots VariableLookup fast path equivalence (7 tests): - name/lookups/command_flags match for simple and deep chains - Bracket inputs fall through to regex path correctly BlockBody.try_parse_tag_token (10 tests): - Simple tags, whitespace control variants ({%-, -%}, both) - No-markup tags, hash comments, newline counting - Hyphenated names stop at hyphen (matching TagName = /\w+/) - Malformed tokens return nil (fallback to FullToken regex)

cpakman force-pushed the liquid-perf-bytetables branch 2 times, most recently from 2377573 to 1f0a6c1 Compare April 5, 2026 19:52

cpakman added 4 commits April 5, 2026 20:53

cpakman force-pushed the liquid-perf-bytetables branch from 1f0a6c1 to d3e3952 Compare April 6, 2026 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: ByteTables + targeted byte-walking for hot parse paths#2070

Performance: ByteTables + targeted byte-walking for hot parse paths#2070
cpakman wants to merge 5 commits intomainfrom
liquid-perf-bytetables

cpakman commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cpakman commented Apr 5, 2026

Summary

Results (Ruby 4.0.2 + YJIT, theme benchmark)

What changed

1. ByteTables module (new, 44 lines)

2. Expression.parse_number — byte-walk instead of regex + StringScanner

3. VariableLookup — fast path for simple identifier chains

4. BlockBody.try_parse_tag_token — byte-walk tag tokens

Design principles

Review process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `ByteTables` module (new, 44 lines)

2. `Expression.parse_number` — byte-walk instead of regex + StringScanner

3. `VariableLookup` — fast path for simple identifier chains

4. `BlockBody.try_parse_tag_token` — byte-walk tag tokens