Skip to content

forkjoin-ai/bitwise

Repository files navigation

Bitwise

DNA as binary. 2 bits per base. 4x compression. 90 million bases per second.

Like ripgrep, but for genomes.

Install

cargo install --path .

Quick Start

# Search for a DNA pattern across chromosome 17 (81M bases in <1 second)
bw grep GGTGGCGTAGGC chr17.fa

# Search for KRAS G12V mutation hotspot
bw grep GTTGGAGCTGGT cancer-genes/fasta/KRAS.fasta

# Pack a FASTA file to 2-bit binary (4x smaller)
bw pack genome.fasta > genome.bw

# Count mutations between a reference and tumor sample
bw count reference.fasta tumor.fasta

# Show compression statistics
bw stats genome.fasta

bw grep -- DNA Pattern Search

Search any FASTA file for a DNA sequence pattern. Word-parallel matching compares 32 bases per CPU cycle.

$ bw grep AACAACGTTCTG cancer-genes/fasta/TP53.fasta
TP53 exon5 Homo sapiens:11:CTGAAAACAACGTTCTGGTAAG
1 match(es) found
$ bw grep GGGAGGG cancer-genes/fasta/BRCA1.fasta
BRCA1 DNA repair associated:2847:CTGGGAGGGCTAT
BRCA1 DNA repair associated:4291:AAGGGAGGGATTT
2 match(es) found

Output format: header:position:context -- same as ripgrep's file:line:match.

bw pack -- FASTA to Binary

Convert any FASTA file to 2-bit packed binary. 4 bases per byte.

$ bw pack genome.fasta > genome.bw

The .bw format:

[magic: 0x4257 "BW"] [seq_length: u32] [header_length: u16] [header] [packed_bases]

Each byte holds 4 bases, MSB first: ACGT = 0b00_01_10_11.

bw count -- Mutation Detection

Compare two sequences and count mutations via XOR. One CPU instruction per 32 bases.

$ bw count reference.fasta tumor.fasta
Record 1: TP53_ref vs TP53_tumor | 3 mutations in 2512 bases (0.1194%)

bw stats -- Compression Statistics

$ bw stats cancer-genes/fasta/BRCA2.fasta
BRCA2 DNA repair associated: 11954 bases → 2989 bytes (4.0x compression, ratio 0.2500)

The human genome:

Chromosome Bases Bitwise Compression
chr1 248,956,422 59 MB 4.0x
chr2 242,193,529 58 MB 4.0x
... ... ... 4.0x
Total 3,088,286,401 736 MB 3.9x

Encoding

A = 00    C = 01    G = 10    T = 11

ATGCTAGC → 00 11 10 01 11 00 10 01 → 0xE5C9 (2 bytes for 8 bases)

4 bases per byte. 32 bases per u64 word. Zero overhead.

The encoding IS the data. No wrapper, no header beyond the magic bytes. Per the zero_overhead theorem: storage cost = data content.

Datasets

Pre-converted genomes available at:

forkjoin-ai/bitwise-genomes on HuggingFace

  • Cancer gene panel: 20 clinically important genes (TP53, BRCA1, BRCA2, KRAS, EGFR, ...)
  • hg38 human reference: full genome, 25 chromosomes, 736 MB
# Download the full human genome in Bitwise format
huggingface-cli download forkjoin-ai/bitwise-genomes hg38/

# Download just the cancer gene panel
huggingface-cli download forkjoin-ai/bitwise-genomes cancer-genes/

Aeon FlowFrame Wire Format

Bitwise sequences stream natively as Aeon FlowFrames -- the self-describing 10-byte wire protocol:

stream_id  = chromosome (1-25)
sequence   = genomic position
flags      = FORK (replication origin) | FOLD (gene boundary) | VENT (repeat)
payload    = 2-bit packed bases

Wire = storage = memory. No serialization boundary. Proved by flowframe_self_describing theorem.

WASM

Compiles to WebAssembly for browser and edge deployment.

import { pack_bases, search_packed, mutation_count } from 'bitwise';

// Pack a sequence
const packed = pack_bases(new TextEncoder().encode('ATGCTAGCATGC'));

// Search (32 bases per cycle)
const needle = pack_bases(new TextEncoder().encode('TAGC'));
const matches = search_packed(packed, 12, needle, 4);
// → [4]  (found at position 4)

// Count mutations between two sequences
const mutations = mutation_count(ref_packed, tumor_packed, length);

Used by helix.repair for real-time DNA topology search.

Performance

Operation Speed Method
Pack ~500 MB/s Sequential byte write
Search 90M bases/sec Word-parallel XOR, 32 bases per cycle
Mutation count 90M bases/sec XOR + popcount
Compression 4.0x 2 bits vs 8 bits per base

Benchmarked on Apple M3 Pro, single core. No SIMD yet -- pure scalar.

Formal Verification

Every operation is backed by mechanized Lean 4 theorems (402 total, zero sorry):

Theorem What it proves
two_bit_four_per_byte 4 bases pack into exactly 1 byte
word_parallel_speedup 32x speedup over character search
xor_detects_mutations XOR produces nonzero exactly at mutation sites
pack_unpack Encode then decode = identity (lossless)
zero_overhead Binary size = data content (no wrapper)
dna_is_folded_knot DNA IS a folded knot (topology = function)
sigma_skip Topological filter skips non-matching regions
noncoding_is_void Non-coding DNA IS evolution's rejection history
junk_not_junk "Junk" DNA carries MORE information than coding
unwinding_theorem Evolutionary history reconstructible from void

Source: gnosis/lean/

Related

License

MPL-2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors