Bitwise

DNA as binary. 2 bits per base. 4x compression. 90 million bases per second.

Like ripgrep, but for genomes.

Install

cargo install --path .

Quick Start

# Search for a DNA pattern across chromosome 17 (81M bases in <1 second)
bw grep GGTGGCGTAGGC chr17.fa

# Search for KRAS G12V mutation hotspot
bw grep GTTGGAGCTGGT cancer-genes/fasta/KRAS.fasta

# Pack a FASTA file to 2-bit binary (4x smaller)
bw pack genome.fasta > genome.bw

# Count mutations between a reference and tumor sample
bw count reference.fasta tumor.fasta

# Show compression statistics
bw stats genome.fasta

`bw grep` -- DNA Pattern Search

Search any FASTA file for a DNA sequence pattern. Word-parallel matching compares 32 bases per CPU cycle.

$ bw grep AACAACGTTCTG cancer-genes/fasta/TP53.fasta
TP53 exon5 Homo sapiens:11:CTGAAAACAACGTTCTGGTAAG
1 match(es) found

$ bw grep GGGAGGG cancer-genes/fasta/BRCA1.fasta
BRCA1 DNA repair associated:2847:CTGGGAGGGCTAT
BRCA1 DNA repair associated:4291:AAGGGAGGGATTT
2 match(es) found

Output format: header:position:context -- same as ripgrep's file:line:match.

`bw pack` -- FASTA to Binary

Convert any FASTA file to 2-bit packed binary. 4 bases per byte.

$ bw pack genome.fasta > genome.bw

The .bw format:

[magic: 0x4257 "BW"] [seq_length: u32] [header_length: u16] [header] [packed_bases]

Each byte holds 4 bases, MSB first: ACGT = 0b00_01_10_11.

`bw count` -- Mutation Detection

Compare two sequences and count mutations via XOR. One CPU instruction per 32 bases.

$ bw count reference.fasta tumor.fasta
Record 1: TP53_ref vs TP53_tumor | 3 mutations in 2512 bases (0.1194%)

`bw stats` -- Compression Statistics

$ bw stats cancer-genes/fasta/BRCA2.fasta
BRCA2 DNA repair associated: 11954 bases → 2989 bytes (4.0x compression, ratio 0.2500)

The human genome:

Chromosome	Bases	Bitwise	Compression
chr1	248,956,422	59 MB	4.0x
chr2	242,193,529	58 MB	4.0x
...	...	...	4.0x
Total	3,088,286,401	736 MB	3.9x

Encoding

A = 00    C = 01    G = 10    T = 11

ATGCTAGC → 00 11 10 01 11 00 10 01 → 0xE5C9 (2 bytes for 8 bases)

4 bases per byte. 32 bases per u64 word. Zero overhead.

The encoding IS the data. No wrapper, no header beyond the magic bytes. Per the zero_overhead theorem: storage cost = data content.

Datasets

Pre-converted genomes available at:

forkjoin-ai/bitwise-genomes on HuggingFace

Cancer gene panel: 20 clinically important genes (TP53, BRCA1, BRCA2, KRAS, EGFR, ...)
hg38 human reference: full genome, 25 chromosomes, 736 MB

# Download the full human genome in Bitwise format
huggingface-cli download forkjoin-ai/bitwise-genomes hg38/

# Download just the cancer gene panel
huggingface-cli download forkjoin-ai/bitwise-genomes cancer-genes/

Aeon FlowFrame Wire Format

Bitwise sequences stream natively as Aeon FlowFrames -- the self-describing 10-byte wire protocol:

stream_id  = chromosome (1-25)
sequence   = genomic position
flags      = FORK (replication origin) | FOLD (gene boundary) | VENT (repeat)
payload    = 2-bit packed bases

Wire = storage = memory. No serialization boundary. Proved by flowframe_self_describing theorem.

WASM

Compiles to WebAssembly for browser and edge deployment.

import { pack_bases, search_packed, mutation_count } from 'bitwise';

// Pack a sequence
const packed = pack_bases(new TextEncoder().encode('ATGCTAGCATGC'));

// Search (32 bases per cycle)
const needle = pack_bases(new TextEncoder().encode('TAGC'));
const matches = search_packed(packed, 12, needle, 4);
// → [4]  (found at position 4)

// Count mutations between two sequences
const mutations = mutation_count(ref_packed, tumor_packed, length);

Used by helix.repair for real-time DNA topology search.

Performance

Operation	Speed	Method
Pack	~500 MB/s	Sequential byte write
Search	90M bases/sec	Word-parallel XOR, 32 bases per cycle
Mutation count	90M bases/sec	XOR + popcount
Compression	4.0x	2 bits vs 8 bits per base

Benchmarked on Apple M3 Pro, single core. No SIMD yet -- pure scalar.

Formal Verification

Every operation is backed by mechanized Lean 4 theorems (402 total, zero sorry):

Theorem	What it proves
`two_bit_four_per_byte`	4 bases pack into exactly 1 byte
`word_parallel_speedup`	32x speedup over character search
`xor_detects_mutations`	XOR produces nonzero exactly at mutation sites
`pack_unpack`	Encode then decode = identity (lossless)
`zero_overhead`	Binary size = data content (no wrapper)
`dna_is_folded_knot`	DNA IS a folded knot (topology = function)
`sigma_skip`	Topological filter skips non-matching regions
`noncoding_is_void`	Non-coding DNA IS evolution's rejection history
`junk_not_junk`	"Junk" DNA carries MORE information than coding
`unwinding_theorem`	Evolutionary history reconstructible from void

Source: gnosis/lean/

License

MPL-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
datasets		datasets
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
cloudbuild-genome-convert.yaml		cloudbuild-genome-convert.yaml
cloudbuild-whole-genome.yaml		cloudbuild-whole-genome.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bitwise

Install

Quick Start

`bw grep` -- DNA Pattern Search

`bw pack` -- FASTA to Binary

`bw count` -- Mutation Detection

`bw stats` -- Compression Statistics

Encoding

Datasets

Aeon FlowFrame Wire Format

WASM

Performance

Formal Verification

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bitwise

Install

Quick Start

bw grep -- DNA Pattern Search

bw pack -- FASTA to Binary

bw count -- Mutation Detection

bw stats -- Compression Statistics

Encoding

Datasets

Aeon FlowFrame Wire Format

WASM

Performance

Formal Verification

Related

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`bw grep` -- DNA Pattern Search

`bw pack` -- FASTA to Binary

`bw count` -- Mutation Detection

`bw stats` -- Compression Statistics

Packages