DNA as binary. 2 bits per base. 4x compression. 90 million bases per second.
Like ripgrep, but for genomes.
cargo install --path .# Search for a DNA pattern across chromosome 17 (81M bases in <1 second)
bw grep GGTGGCGTAGGC chr17.fa
# Search for KRAS G12V mutation hotspot
bw grep GTTGGAGCTGGT cancer-genes/fasta/KRAS.fasta
# Pack a FASTA file to 2-bit binary (4x smaller)
bw pack genome.fasta > genome.bw
# Count mutations between a reference and tumor sample
bw count reference.fasta tumor.fasta
# Show compression statistics
bw stats genome.fastaSearch any FASTA file for a DNA sequence pattern. Word-parallel matching compares 32 bases per CPU cycle.
$ bw grep AACAACGTTCTG cancer-genes/fasta/TP53.fasta
TP53 exon5 Homo sapiens:11:CTGAAAACAACGTTCTGGTAAG
1 match(es) found$ bw grep GGGAGGG cancer-genes/fasta/BRCA1.fasta
BRCA1 DNA repair associated:2847:CTGGGAGGGCTAT
BRCA1 DNA repair associated:4291:AAGGGAGGGATTT
2 match(es) foundOutput format: header:position:context -- same as ripgrep's file:line:match.
Convert any FASTA file to 2-bit packed binary. 4 bases per byte.
$ bw pack genome.fasta > genome.bwThe .bw format:
[magic: 0x4257 "BW"] [seq_length: u32] [header_length: u16] [header] [packed_bases]
Each byte holds 4 bases, MSB first: ACGT = 0b00_01_10_11.
Compare two sequences and count mutations via XOR. One CPU instruction per 32 bases.
$ bw count reference.fasta tumor.fasta
Record 1: TP53_ref vs TP53_tumor | 3 mutations in 2512 bases (0.1194%)$ bw stats cancer-genes/fasta/BRCA2.fasta
BRCA2 DNA repair associated: 11954 bases → 2989 bytes (4.0x compression, ratio 0.2500)The human genome:
| Chromosome | Bases | Bitwise | Compression |
|---|---|---|---|
| chr1 | 248,956,422 | 59 MB | 4.0x |
| chr2 | 242,193,529 | 58 MB | 4.0x |
| ... | ... | ... | 4.0x |
| Total | 3,088,286,401 | 736 MB | 3.9x |
A = 00 C = 01 G = 10 T = 11
ATGCTAGC → 00 11 10 01 11 00 10 01 → 0xE5C9 (2 bytes for 8 bases)
4 bases per byte. 32 bases per u64 word. Zero overhead.
The encoding IS the data. No wrapper, no header beyond the magic bytes. Per the zero_overhead theorem: storage cost = data content.
Pre-converted genomes available at:
forkjoin-ai/bitwise-genomes on HuggingFace
- Cancer gene panel: 20 clinically important genes (TP53, BRCA1, BRCA2, KRAS, EGFR, ...)
- hg38 human reference: full genome, 25 chromosomes, 736 MB
# Download the full human genome in Bitwise format
huggingface-cli download forkjoin-ai/bitwise-genomes hg38/
# Download just the cancer gene panel
huggingface-cli download forkjoin-ai/bitwise-genomes cancer-genes/Bitwise sequences stream natively as Aeon FlowFrames -- the self-describing 10-byte wire protocol:
stream_id = chromosome (1-25)
sequence = genomic position
flags = FORK (replication origin) | FOLD (gene boundary) | VENT (repeat)
payload = 2-bit packed bases
Wire = storage = memory. No serialization boundary. Proved by flowframe_self_describing theorem.
Compiles to WebAssembly for browser and edge deployment.
import { pack_bases, search_packed, mutation_count } from 'bitwise';
// Pack a sequence
const packed = pack_bases(new TextEncoder().encode('ATGCTAGCATGC'));
// Search (32 bases per cycle)
const needle = pack_bases(new TextEncoder().encode('TAGC'));
const matches = search_packed(packed, 12, needle, 4);
// → [4] (found at position 4)
// Count mutations between two sequences
const mutations = mutation_count(ref_packed, tumor_packed, length);Used by helix.repair for real-time DNA topology search.
| Operation | Speed | Method |
|---|---|---|
| Pack | ~500 MB/s | Sequential byte write |
| Search | 90M bases/sec | Word-parallel XOR, 32 bases per cycle |
| Mutation count | 90M bases/sec | XOR + popcount |
| Compression | 4.0x | 2 bits vs 8 bits per base |
Benchmarked on Apple M3 Pro, single core. No SIMD yet -- pure scalar.
Every operation is backed by mechanized Lean 4 theorems (402 total, zero sorry):
| Theorem | What it proves |
|---|---|
two_bit_four_per_byte |
4 bases pack into exactly 1 byte |
word_parallel_speedup |
32x speedup over character search |
xor_detects_mutations |
XOR produces nonzero exactly at mutation sites |
pack_unpack |
Encode then decode = identity (lossless) |
zero_overhead |
Binary size = data content (no wrapper) |
dna_is_folded_knot |
DNA IS a folded knot (topology = function) |
sigma_skip |
Topological filter skips non-matching regions |
noncoding_is_void |
Non-coding DNA IS evolution's rejection history |
junk_not_junk |
"Junk" DNA carries MORE information than coding |
unwinding_theorem |
Evolutionary history reconstructible from void |
Source: gnosis/lean/
- helix.repair -- DNA topology search engine (live)
- bitwise-genomes -- pre-converted genome datasets
- Aunt Sandy -- cancer genomics via Buleyean probability
- Gnosis -- formal verification (402 Lean theorems)
- Aeon FlowFrames -- self-describing wire protocol
MPL-2.0