The world's only agentic web crawler with a peer-to-peer mesh.
Built using the brain of a human that knows about distributed crawling architectures.
Endpoints · Mesh · Anti-Detection · Ghost Protocol · Live Stream · MCP Tools · Quick Start · Architecture
Grub Crawler gets dirty so you don't have to. It penetrates every layer of protection — Cloudflare, CAPTCHAs, JavaScript walls — fingers deep in the DOM until it finds what it came for. When the front door's locked, Ghost Protocol slips in the back, takes pictures of everything, and lets the AI read it naked. Multi-provider? Oh yeah — it'll ride OpenAI, Anthropic, and Ollama all in the same session. No safeword. No cooldown. Just raw, unfiltered content extraction that leaves every page fully exposed and dripping with markdown.
We integrated features from every major crawler — then added what none of them have.
| Feature | Crawl4AI | Firecrawl | Apify | Scrapy | Browserbase | Scrapfly | Grub |
|---|---|---|---|---|---|---|---|
| Self-hosted | ✅ | ✅ Crawlee | ✅ | ❌ cloud | ❌ cloud | ✅ full | |
| Anti-detect browser | stealth plugin | ❌ cloud only | Camoufox template | ❌ | custom Chromium | proprietary | ✅ Camoufox |
| Ghost Protocol | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ auto fallback |
| Per-request proxy | ✅ escalation | ✅ built-in | middleware | ✅ managed | ✅ 130M+ IPs | ✅ per-request | |
| Stealth patches | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ opt-in |
| Agent loop | ✅ agentic | ✅ /agent | ✅ AI Agent | ❌ spiders | ✅ Stagehand | ✅ bounded SM | |
| Live browser stream | ✅ WebSocket | ✅ Live View | ❌ | ✅ iFrame + CDP | ✅ CDP | ✅ WS + MJPEG | |
| Markdown output | ✅ Fit Markdown | ✅ core | ✅ RAG Browser | ❌ | ✅ via MCP | ✅ built-in | ✅ core |
| MCP tools | ✅ community | ✅ official | ✅ official | ✅ official | ✅ official | ✅ 15 tools | |
| Multi-provider LLM | ✅ all LLMs | ❌ | ✅ OpenAI/Anthropic/Ollama | ||||
| Policy enforcement | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ domain gates + redaction |
| Replayable traces | ❌ | ❌ | ❌ | ❌ | ✅ full JSON trace | ||
| Prompt injection defense | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ quarantine + visible-text diff |
| License | Apache 2.0 | AGPL-3.0 | MIT (Crawlee) | BSD | MIT (Stagehand) | Proprietary | Proprietary |
| Pricing | Free | Free–$333/mo | Free–$999/mo | Free | Free–$99/mo | Usage-based | Self-hosted |
| Mesh P2P | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ agents talking to agents |
Only Grub Crawler has Ghost Protocol — automatic vision-based fallback that screenshots blocked pages and extracts content via LLM when every other tool just fails. Prevention (Camoufox + proxy + stealth) handles 95% of blocks. Ghost Protocol handles the rest.
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/api/crawl |
Single URL crawl (HTML + markdown) | Live |
POST |
/api/markdown |
Single or multi-URL markdown extraction | Live |
POST |
/api/batch |
Batch crawl with job tracking | Live |
POST |
/api/raw |
Raw HTML extraction (no markdown) | Live |
GET |
/view |
Browser-rendered HTML viewer | Live |
GET |
/download |
File download (PDFs, etc.) through crawler | Live |
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/api/agent/run |
Submit task to autonomous agent loop | Live |
GET |
/api/agent/status/{run_id} |
Check agent run status / load trace | Live |
POST |
/api/agent/ghost |
Ghost Protocol: screenshot + vision extract | Live |
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/api/jobs/create |
Generic job submission | Live |
POST |
/api/jobs/crawl |
Submit single URL crawl job | Live |
POST |
/api/jobs/batch-crawl |
Submit batch crawl job | Live |
POST |
/api/jobs/markdown |
Submit markdown-only job | Live |
POST |
/api/jobs/process-job |
Cloud Tasks worker endpoint | Live |
POST |
/api/wraith |
AI-driven crawl workflow | Placeholder |
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/api/cache/search |
Fuzzy search cached content | Live |
GET |
/api/cache/list |
List cached document metadata | Live |
GET |
/api/cache/doc/{doc_id} |
Fetch one cached document | Live |
POST |
/api/cache/upsert |
Upsert cache entries | Live |
POST |
/api/cache/prune |
Prune cache entries by TTL/domain | Live |
| Method | Path | Description | Status |
|---|---|---|---|
GET |
/api/sessions/{session_id}/files |
List session files | Live |
GET |
/api/sessions/{session_id}/file |
Get specific file | Live |
GET |
/api/sessions/{session_id}/status |
Session progress status | Live |
GET |
/api/sessions/{session_id}/results |
All crawl results | Live |
GET |
/api/sessions/{session_id}/screenshots |
List screenshots | Live |
| Method | Path | Description | Status |
|---|---|---|---|
WS |
/stream/{session_id} |
WebSocket viewport stream | Live |
GET |
/stream/{session_id}/mjpeg |
MJPEG fallback stream | Live |
GET |
/stream/{session_id}/status |
Stream session status | Live |
GET |
/stream/pool/status |
Browser pool status | Live |
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/mesh/join |
Peer join + gossip discovery | Live |
POST |
/mesh/heartbeat |
Peer heartbeat with load metrics | Live |
POST |
/mesh/execute |
Cross-node tool execution (1-hop max) | Live |
POST |
/mesh/leave |
Peer departure notification | Live |
GET |
/mesh/peers |
List known peers + health status | Live |
GET |
/mesh/status |
This node's mesh status + load | Live |
| Method | Path | Description | Status |
|---|---|---|---|
GET |
/health |
Health check + tool count + mesh info | Live |
GET |
/tools |
List registered AHP tools | Live |
GET |
/site |
Embedded landing page | Live |
GET |
/{tool_name} |
Execute AHP tool (catch-all) | Live |
The MCP bridge exposes all capabilities to any MCP-compatible host:
| Tool | Description | Status |
|---|---|---|
crawl_url |
Single URL markdown extraction with JS injection | Live |
crawl_batch |
Batch processing up to 50 URLs with collation | Live |
raw_html |
Raw HTML fetch without conversion | Live |
download_file |
Download files (PDFs, etc.) through crawler | Live |
crawl_validate |
Content quality assessment | Live |
crawl_search |
Fuzzy search local crawl cache | Live |
crawl_cache_list |
List local cached files | Live |
crawl_remote_search |
Search remote crawler cache | Live |
crawl_remote_cache_list |
List remote cache entries | Live |
crawl_remote_cache_doc |
Fetch remote cached document | Live |
agent_run |
Submit task to autonomous agent (Mode B) | Live |
agent_status |
Check agent run status | Live |
ghost_extract |
Ghost Protocol: screenshot + vision AI extraction | Live |
mesh_peers |
List mesh peers and their health/load status | Live |
mesh_status |
Get this node's mesh status and load metrics | Live |
set_auth_token |
Save auth token to .wraithenv | Live |
crawl_status |
Report configuration and connection | Live |
| File | Purpose | Status |
|---|---|---|
types.py |
RunState enum, StopReason, ToolCall, ToolResult, AssistantAction, RunConfig, RunContext, StepTrace, RunResult |
Done |
errors.py |
Typed errors: validation_error, policy_denied, tool_timeout, tool_unavailable, execution_error, provider_error, stop_condition |
Done |
dispatcher.py |
Tool validation, timeout enforcement (30s), retry (1x), typed error normalization | Done |
engine.py |
Bounded loop: plan -> execute -> observe -> stop. EventBus integration. Returns (RunResult, RunSummary) |
Done |
ghost.py |
Ghost Protocol: block detection, screenshot capture, vision extraction, auto-trigger | Done |
| File | Purpose | Status |
|---|---|---|
base.py |
LLMAdapter ABC, FallbackAdapter (rotate on failure), factory functions |
Done |
openai_adapter.py |
OpenAI tool_calls mapping, GPT-4o vision | Done |
anthropic_adapter.py |
Anthropic tool_use/tool_result blocks, Claude Sonnet vision | Done |
ollama_adapter.py |
Ollama HTTP /api/chat, llava vision |
Done |
| File | Purpose | Status |
|---|---|---|
domain.py |
Domain allowlist, RFC-1918/loopback/link-local deny | Done |
gate.py |
Pre-tool and pre-fetch policy checks with PolicyVerdict |
Done |
redaction.py |
Secret pattern redaction (API keys, JWTs, private keys) | Done |
| File | Purpose | Status |
|---|---|---|
events.py |
EventBus + 7 typed events: run_start, step_start, tool_dispatch, tool_result, policy_denied, step_end, run_end |
Done |
trace.py |
TraceCollector, RunSummary JSON serialization, persist_trace() / load_trace() via storage |
Done |
| File | Purpose | Status |
|---|---|---|
agent_routes.py |
POST /api/agent/run, GET /api/agent/status/{run_id}. 503 when disabled |
Done |
routes.py |
Core crawl/markdown/batch/cache REST endpoints | Done |
job_routes.py |
Job CRUD, session status, Cloud Tasks worker | Done |
jobs.py |
JobType enum (incl. AGENT_RUN), JobManager, JobProcessor |
Done |
models.py |
All Pydantic models incl. AgentRunRequest/Response |
Done |
| File | Purpose | Status |
|---|---|---|
stealth.py |
playwright-stealth patches, tracker domain blocking | Done |
proxy.py |
Per-request proxy resolution with env fallback | Done |
| File | Purpose | Status |
|---|---|---|
models.py |
Wire protocol models: NodeInfo, NodeLoad, MeshToolRequest/Response, PeerState | Done |
auth.py |
HMAC-SHA256 token signing/verification with 60s TTL | Done |
client.py |
httpx async client for join, heartbeat, leave, execute_tool | Done |
coordinator.py |
Lifecycle, peer table, heartbeat loop with seed retry | Done |
routes.py |
/mesh/* endpoints — join, heartbeat, execute, leave, peers, status |
Done |
router.py |
Load scoring + target selection (pure logic, no I/O) | Done |
dispatcher.py |
MeshDispatcher wrapping local Dispatcher for transparent routing | Done |
| File | Purpose | Status |
|---|---|---|
config.py |
All env vars incl. agent + provider + ghost + proxy + stealth config | Done |
storage.py |
User-partitioned storage (local filesystem / GCS) | Done |
crawler.py |
Playwright crawling engine with proxy support | Done |
markdown.py |
HTML to markdown conversion | Done |
browser.py |
Browser automation — Chromium + Camoufox engines | Done |
browser_pool.py |
Persistent browser pool with lease/return pattern | Done |
stream.py |
CDP screencast → WebSocket/MJPEG relay + interactive commands | Done |
INIT -> PLAN -> EXECUTE_TOOL -> OBSERVE -> PLAN -> ... -> RESPOND -> STOP
| |
+-- policy_denied ---------------------->+
+-- max_steps / max_wall_time / max_failures -> STOP
+-- no_op_loop (3x empty) ------------> STOP
+-- blocked (ghost trigger) -----------> GHOST -> OBSERVE
Stop conditions enforced every iteration:
max_steps(default: 12)max_wall_time(default: 90s)max_failures(default: 3)no_op_loop(3 consecutive empty responses)policy_denied(blocked tool/domain)completed(agent responds with text)
Three layers of anti-detection that stack together. Prevention stops blocks before they happen. Ghost Protocol handles them after.
Pluggable anti-detect browser with C++-level fingerprint spoofing. No manual user-agent tricks — Camoufox generates realistic fingerprints per context at the browser level, including canvas, WebGL, fonts, and navigator properties.
# Switch engine (default: chromium)
BROWSER_ENGINE=camoufoxRoute crawl traffic through residential, datacenter, or custom proxy pools. Per-request override with env-based defaults. Full Playwright-compatible proxy config.
# Env-based default
PROXY_SERVER=http://proxy.example.com:10001
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password
# Or per-request
curl -X POST http://localhost:6792/api/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"options": {
"proxy": {
"server": "http://proxy.example.com:10001",
"username": "your_username",
"password": "your_password"
}
}
}'Opt-in playwright-stealth patches for Chromium (skipped for Camoufox where it's built-in). Blocks 20+ tracking/analytics domains (Google Analytics, DataDome, PerimeterX, etc.) to reduce fingerprint surface.
STEALTH_ENABLED=true
BLOCK_TRACKING_DOMAINS=trueWhen a crawl result signals an anti-bot block (Cloudflare challenge, CAPTCHA, empty SPA shell), the agent can switch to cloak mode:
- Take a full-page screenshot via Playwright
- Send the image to a vision-capable LLM (Claude Sonnet or GPT-4o)
- Extract content from the rendered pixels
- Return extracted text with
render_mode: "ghost"in the trace
This bypasses DOM-based anti-bot detection entirely.
Requires AGENT_GHOST_ENABLED=true. Auto-triggers on detected blocks when AGENT_GHOST_AUTO_TRIGGER=true.
Agents talking to agents. Every Grub instance is both a worker and a coordinator. Local node offloads to cloud, cloud delegates to local. Tool calls cross the wire transparently.
Node A (local) Node B (cloud)
┌─────────────┐ ┌─────────────┐
│ AgentEngine │ │ AgentEngine │
│ ↓ │ │ ↓ │
│ MeshDispatcher ──── HTTP ────→ MeshDispatcher │
│ ↓ │ │ ↓ │
│ Dispatcher │ │ Dispatcher │
│ ↓ │ │ ↓ │
│ ToolRegistry │ │ ToolRegistry │
└─────────────┘ └─────────────┘
↕ heartbeat (15s) ↕
└────────────────────────────────┘
How it works:
- Discovery — nodes join via seed peer list, then gossip (1-hop) to learn about others
- Heartbeat — every 15s, nodes exchange load metrics. 3 missed = unhealthy. 2 min = removed
- Routing — MeshDispatcher scores all nodes by load, locality, and affinity, then routes tool calls to the best node
- 1-hop max — Node A → B only, never A → B → C. Prevents routing loops
- Local fallback — if remote execution fails, falls back to local Dispatcher
- HMAC auth — all mesh traffic is signed with a shared secret (SHA-256, 60s TTL)
# Docker Compose (recommended)
./deploy.sh mesh # Linux/Mac
./deploy.ps1 -Target mesh # Windows
# Verify
curl http://localhost:6792/mesh/peers # Node A sees Node B
curl http://localhost:6793/mesh/peers # Node B sees Node A# Deploy to Cloud Run with mesh
./deploy.sh cloudrun latest --mesh-peer http://your-local-ip:6792 --mesh-secret mysecret
# Start local node
MESH_ENABLED=true MESH_SECRET=mysecret MESH_PEERS=https://your-cloud-run-url \
MESH_ADVERTISE_URL=http://your-local-ip:6792 \
uvicorn app.main:app --port 6792# Node A
MESH_ENABLED=true MESH_NODE_NAME=local MESH_SECRET=test123 \
MESH_ADVERTISE_URL=http://localhost:6792 \
uvicorn app.main:app --port 6792
# Node B
MESH_ENABLED=true MESH_NODE_NAME=cloud MESH_SECRET=test123 \
MESH_PEERS=http://localhost:6792 \
MESH_ADVERTISE_URL=http://localhost:8081 \
uvicorn app.main:app --port 8081When mesh is disabled (MESH_ENABLED=false, the default), Grub operates as a normal single-node crawler with zero mesh overhead.
Watch the crawler work in real-time. A persistent pool of warm Chromium instances streams viewport frames over WebSocket or MJPEG.
WebSocket — connect and send interactive commands:
const ws = new WebSocket("ws://localhost:6792/stream/my-session?url=https://example.com");
ws.onmessage = (e) => {
const msg = JSON.parse(e.data);
if (msg.type === "frame") document.getElementById("viewport").src = "data:image/jpeg;base64," + msg.data;
};
// Navigate, click, scroll, type — all over the same socket
ws.send(JSON.stringify({ action: "navigate", url: "https://example.com/pricing" }));
ws.send(JSON.stringify({ action: "click", selector: "#signup-btn" }));
ws.send(JSON.stringify({ action: "scroll", direction: "down" }));MJPEG — drop it in an <img> tag, instant video:
<img src="http://localhost:6792/stream/my-session/mjpeg?url=https://example.com" />Requires BROWSER_STREAM_ENABLED=true. Each Chromium instance uses ~150-300MB RAM.
git clone <repo>
cd grub-crawl
cp .env.example .env
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 6792# Add to .env
AGENT_ENABLED=true
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...
AGENT_PROVIDER=anthropiccurl -X POST http://localhost:6792/api/agent/run \
-H "Content-Type: application/json" \
-d '{
"task": "Find the pricing page on example.com and extract plan details",
"max_steps": 10,
"allowed_domains": ["example.com"]
}'# Single node
./deploy.sh local # or ./deploy.ps1 -Target local
# 2-node mesh
./deploy.sh mesh # or ./deploy.ps1 -Target mesh
# Cloud Run
./deploy.sh cloudrun v1.0.0 # or ./deploy.ps1 -Target cloudrun -Tag v1.0.0
# Cloud Run + mesh (connect to local node)
./deploy.sh cloudrun v1.0.0 --mesh-peer http://your-ip:6792 --mesh-secret mykey# Add to .env
BROWSER_ENGINE=camoufox
STEALTH_ENABLED=true
BLOCK_TRACKING_DOMAINS=true
# Optional: proxy
PROXY_SERVER=http://proxy.example.com:10001
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password# Add to .env
AGENT_GHOST_ENABLED=true
curl -X POST http://localhost:6792/api/agent/ghost \
-H "Content-Type: application/json" \
-d '{"url": "https://blocked-site.com"}'# Add to .env
BROWSER_STREAM_ENABLED=true
BROWSER_POOL_SIZE=2
# MJPEG (open in browser)
open "http://localhost:6792/stream/demo/mjpeg?url=https://example.com"HOST(default: 0.0.0.0)PORT(default: 6792)DEBUG(default: false)
STORAGE_PATH(default: ./storage)RUNNING_IN_CLOUD(default: false)GCS_BUCKET_NAMEGOOGLE_CLOUD_PROJECT
DISABLE_AUTH(default: false)GNOSIS_AUTH_URL(default: http://gnosis-auth:5000)
BROWSER_ENGINE— chromium | camoufox (default: chromium)
MAX_CONCURRENT_CRAWLS(default: 5)CRAWL_TIMEOUT(default: 30)ENABLE_JAVASCRIPT(default: true)ENABLE_SCREENSHOTS(default: false)
PROXY_SERVER— proxy URL (e.g. http://proxy:10001)PROXY_USERNAMEPROXY_PASSWORDPROXY_BYPASS— comma-separated bypass list
STEALTH_ENABLED(default: false) — playwright-stealth patchesBLOCK_TRACKING_DOMAINS(default: false) — block analytics/tracking requests
AGENT_ENABLED(default: false)AGENT_MAX_STEPS(default: 12)AGENT_MAX_WALL_TIME_MS(default: 90000)AGENT_MAX_FAILURES(default: 3)AGENT_ALLOWED_TOOLS— comma-separated allowlistAGENT_ALLOWED_DOMAINS— comma-separated allowlistAGENT_BLOCK_PRIVATE_RANGES(default: true)AGENT_REDACT_SECRETS(default: true)
AGENT_PROVIDER— openai | anthropic | ollama (default: openai)OPENAI_API_KEYOPENAI_MODEL(default: gpt-4.1-mini)ANTHROPIC_API_KEYANTHROPIC_MODEL(default: claude-3-5-sonnet-latest)OLLAMA_BASE_URL(default: http://localhost:11434)OLLAMA_MODEL(default: llama3.1:8b-instruct)
AGENT_GHOST_ENABLED(default: false)AGENT_GHOST_AUTO_TRIGGER(default: true)AGENT_GHOST_VISION_PROVIDER— inherits from AGENT_PROVIDERAGENT_GHOST_MAX_IMAGE_WIDTH(default: 1280)
MESH_ENABLED(default: false) — master switchMESH_PEERS— comma-separated seed peer URLsMESH_NODE_NAME— human-readable name (default: hostname)MESH_SECRET— shared HMAC secret for inter-node authMESH_ADVERTISE_URL— URL peers use to reach this nodeMESH_PREFER_LOCAL(default: true) — bias toward local executionMESH_HEARTBEAT_INTERVAL_S(default: 15)MESH_PEER_TIMEOUT_S(default: 45) — mark unhealthy after thisMESH_PEER_REMOVE_S(default: 120) — remove from peer table after thisMESH_REMOTE_TIMEOUT_MS(default: 35000) — timeout for remote tool calls
BROWSER_POOL_SIZE(default: 1)BROWSER_STREAM_ENABLED(default: false)BROWSER_STREAM_QUALITY(default: 25) — JPEG quality 1-100BROWSER_STREAM_MAX_WIDTH(default: 854)BROWSER_STREAM_MAX_LEASE_SECONDS(default: 300)
POST /api/markdown returns:
success, url, final_url, status_code, markdown, markdown_plain, content, render_mode, wait_strategy, timings_ms, blocked, block_reason, captcha_detected, http_error_family, body_char_count, body_word_count, visible_char_count, visible_word_count, visible_similarity, quarantined, quarantine_reason, policy_flags, content_quality, extractor_version, normalized_url, content_hash
blocked— anti-bot/captcha/challengeempty— very low signalminimal— thin/error pagessufficient— usable for summarization
Do not summarize unless content_quality == "sufficient".
quarantined=truemeans the extractor detected instruction-like text in extracted content that was not present in the page's visible rendered text (common in.sr-only/visually-hidden abuse).- When quarantined,
content_qualityis downgraded tominimal,policy_flagsincludeshidden_text_suspectedandquarantined, andcontent/markdownoutputs are blanked (fail-closed).
{"error": "http_error|validation_error|internal_error", "status": 400, "details": {}}- Agent core — state machine, types, errors (W1)
- Unified tool contract — dispatcher with timeout/retry (W2)
- Policy gates — domain allowlist, private-range deny, redaction (W3)
- Observability — EventBus, TraceCollector, RunSummary persistence (W4)
- API wiring —
/api/agent/run,/api/agent/status, JobType.AGENT_RUN (W5) - Provider adapters — OpenAI, Anthropic, Ollama with fallback (W6)
- Config flags — agent, provider, ghost, stream settings (W7)
- Cloak-mode trigger detection (W8)
- Screenshot capture pipeline (W8)
- Vision extraction via Claude/GPT-4o (W8)
- Fallback chain in engine (W8)
- Ghost tool for external callers (W8)
- Ghost MCP tool + REST endpoint (W8)
- Persistent browser pool with lease/return (W9)
- CDP screencast relay (W9)
- WebSocket endpoint with interactive commands (W9)
- MJPEG fallback stream (W9)
- Stream status + pool status endpoints (W9)
- Camoufox anti-detect browser engine (W10)
- Per-request proxy with env fallback (W10)
- Stealth patches for Chromium (W10)
- Tracker/analytics domain blocking (W10)
- Anthropic vision format detection fix (W10)
- Peer discovery with gossip (1-hop) (W11)
- HMAC-SHA256 inter-node auth (W11)
- Heartbeat loop with load metrics + seed retry (W11)
- MeshDispatcher — transparent cross-node tool routing (W12)
- Load-based scoring with locality/affinity bonus (W12)
- Deploy scripts — local, mesh, Cloud Run (W12)
- Docker Compose 2-node mesh topology (W12)
- Embedded landing page (grub-site) (W12)
- Unit test suite — 176 tests across all modules
- Error handling improvements
- Monitoring and alerting
- Performance optimization
See MASTER_PLAN.md for the full architecture plan.
Grub Crawler Project License