GitHub - DeepBlueDynamics/grubcrawler: The world's fastest agentic crawler. Reclaimed. Reinvented. Ready for war.

The world's only agentic web crawler with a peer-to-peer mesh.

Built using the brain of a human that knows about distributed crawling architectures.

Endpoints · Mesh · Anti-Detection · Ghost Protocol · Live Stream · MCP Tools · Quick Start · Architecture

Grub Crawler gets dirty so you don't have to. It penetrates every layer of protection — Cloudflare, CAPTCHAs, JavaScript walls — fingers deep in the DOM until it finds what it came for. When the front door's locked, Ghost Protocol slips in the back, takes pictures of everything, and lets the AI read it naked. Multi-provider? Oh yeah — it'll ride OpenAI, Anthropic, and Ollama all in the same session. No safeword. No cooldown. Just raw, unfiltered content extraction that leaves every page fully exposed and dripping with markdown.

Why Grub

We integrated features from every major crawler — then added what none of them have.

Feature	Crawl4AI	Firecrawl	Apify	Scrapy	Browserbase	Scrapfly	Grub
Self-hosted	✅	⚠️ limited	✅ Crawlee	✅	❌ cloud	❌ cloud	✅ full
Anti-detect browser	stealth plugin	❌ cloud only	Camoufox template	❌	custom Chromium	proprietary	✅ Camoufox
Ghost Protocol	❌	❌	❌	❌	❌	❌	✅ auto fallback
Per-request proxy	✅ escalation	⚠️ cloud only	✅ built-in	middleware	✅ managed	✅ 130M+ IPs	✅ per-request
Stealth patches	✅	❌	✅	❌	✅	✅	✅ opt-in
Agent loop	✅ agentic	✅ /agent	✅ AI Agent	❌ spiders	✅ Stagehand	⚠️ via integrations	✅ bounded SM
Live browser stream	✅ WebSocket	✅ Live View	⚠️ pool only	❌	✅ iFrame + CDP	✅ CDP	✅ WS + MJPEG
Markdown output	✅ Fit Markdown	✅ core	✅ RAG Browser	❌	✅ via MCP	✅ built-in	✅ core
MCP tools	✅ community	✅ official	✅ official	⚠️ community	✅ official	✅ official	✅ 15 tools
Multi-provider LLM	✅ all LLMs	⚠️ Gemini	⚠️ per-Actor	❌	⚠️ Stagehand	⚠️ via frameworks	✅ OpenAI/Anthropic/Ollama
Policy enforcement	❌	❌	❌	❌	❌	❌	✅ domain gates + redaction
Replayable traces	❌	❌	⚠️ run logs	❌	⚠️ session replay	❌	✅ full JSON trace
Prompt injection defense	❌	❌	❌	❌	❌	❌	✅ quarantine + visible-text diff
License	Apache 2.0	AGPL-3.0	MIT (Crawlee)	BSD	MIT (Stagehand)	Proprietary	Proprietary
Pricing	Free	Free–$333/mo	Free–$999/mo	Free	Free–$99/mo	Usage-based	Self-hosted
Mesh P2P	❌	❌	❌	❌	❌	❌	✅ agents talking to agents

Only Grub Crawler has Ghost Protocol — automatic vision-based fallback that screenshots blocked pages and extracts content via LLM when every other tool just fails. Prevention (Camoufox + proxy + stealth) handles 95% of blocks. Ghost Protocol handles the rest.

API Endpoints

Core Crawling

Method	Path	Description	Status
`POST`	`/api/crawl`	Single URL crawl (HTML + markdown)	Live
`POST`	`/api/markdown`	Single or multi-URL markdown extraction	Live
`POST`	`/api/batch`	Batch crawl with job tracking	Live
`POST`	`/api/raw`	Raw HTML extraction (no markdown)	Live
`GET`	`/view`	Browser-rendered HTML viewer	Live
`GET`	`/download`	File download (PDFs, etc.) through crawler	Live

Agent (Mode B)

Method	Path	Description	Status
`POST`	`/api/agent/run`	Submit task to autonomous agent loop	Live
`GET`	`/api/agent/status/{run_id}`	Check agent run status / load trace	Live
`POST`	`/api/agent/ghost`	Ghost Protocol: screenshot + vision extract	Live

Job Management

Method	Path	Description	Status
`POST`	`/api/jobs/create`	Generic job submission	Live
`POST`	`/api/jobs/crawl`	Submit single URL crawl job	Live
`POST`	`/api/jobs/batch-crawl`	Submit batch crawl job	Live
`POST`	`/api/jobs/markdown`	Submit markdown-only job	Live
`POST`	`/api/jobs/process-job`	Cloud Tasks worker endpoint	Live
`POST`	`/api/wraith`	AI-driven crawl workflow	Placeholder

Remote Cache

Method	Path	Description	Status
`POST`	`/api/cache/search`	Fuzzy search cached content	Live
`GET`	`/api/cache/list`	List cached document metadata	Live
`GET`	`/api/cache/doc/{doc_id}`	Fetch one cached document	Live
`POST`	`/api/cache/upsert`	Upsert cache entries	Live
`POST`	`/api/cache/prune`	Prune cache entries by TTL/domain	Live

Session Management

Method	Path	Description	Status
`GET`	`/api/sessions/{session_id}/files`	List session files	Live
`GET`	`/api/sessions/{session_id}/file`	Get specific file	Live
`GET`	`/api/sessions/{session_id}/status`	Session progress status	Live
`GET`	`/api/sessions/{session_id}/results`	All crawl results	Live
`GET`	`/api/sessions/{session_id}/screenshots`	List screenshots	Live

Live Stream

Method	Path	Description	Status
`WS`	`/stream/{session_id}`	WebSocket viewport stream	Live
`GET`	`/stream/{session_id}/mjpeg`	MJPEG fallback stream	Live
`GET`	`/stream/{session_id}/status`	Stream session status	Live
`GET`	`/stream/pool/status`	Browser pool status	Live

Mesh

Method	Path	Description	Status
`POST`	`/mesh/join`	Peer join + gossip discovery	Live
`POST`	`/mesh/heartbeat`	Peer heartbeat with load metrics	Live
`POST`	`/mesh/execute`	Cross-node tool execution (1-hop max)	Live
`POST`	`/mesh/leave`	Peer departure notification	Live
`GET`	`/mesh/peers`	List known peers + health status	Live
`GET`	`/mesh/status`	This node's mesh status + load	Live

System

Method	Path	Description	Status
`GET`	`/health`	Health check + tool count + mesh info	Live
`GET`	`/tools`	List registered AHP tools	Live
`GET`	`/site`	Embedded landing page	Live
`GET`	`/{tool_name}`	Execute AHP tool (catch-all)	Live

MCP Tools (grub-crawl.py)

The MCP bridge exposes all capabilities to any MCP-compatible host:

Tool	Description	Status
`crawl_url`	Single URL markdown extraction with JS injection	Live
`crawl_batch`	Batch processing up to 50 URLs with collation	Live
`raw_html`	Raw HTML fetch without conversion	Live
`download_file`	Download files (PDFs, etc.) through crawler	Live
`crawl_validate`	Content quality assessment	Live
`crawl_search`	Fuzzy search local crawl cache	Live
`crawl_cache_list`	List local cached files	Live
`crawl_remote_search`	Search remote crawler cache	Live
`crawl_remote_cache_list`	List remote cache entries	Live
`crawl_remote_cache_doc`	Fetch remote cached document	Live
`agent_run`	Submit task to autonomous agent (Mode B)	Live
`agent_status`	Check agent run status	Live
`ghost_extract`	Ghost Protocol: screenshot + vision AI extraction	Live
`mesh_peers`	List mesh peers and their health/load status	Live
`mesh_status`	Get this node's mesh status and load metrics	Live
`set_auth_token`	Save auth token to .wraithenv	Live
`crawl_status`	Report configuration and connection	Live

Internal Modules

Agent Core (`app/agent/`)

File	Purpose	Status
`types.py`	`RunState` enum, `StopReason`, `ToolCall`, `ToolResult`, `AssistantAction`, `RunConfig`, `RunContext`, `StepTrace`, `RunResult`	Done
`errors.py`	Typed errors: `validation_error`, `policy_denied`, `tool_timeout`, `tool_unavailable`, `execution_error`, `provider_error`, `stop_condition`	Done
`dispatcher.py`	Tool validation, timeout enforcement (30s), retry (1x), typed error normalization	Done
`engine.py`	Bounded loop: `plan -> execute -> observe -> stop`. EventBus integration. Returns `(RunResult, RunSummary)`	Done
`ghost.py`	Ghost Protocol: block detection, screenshot capture, vision extraction, auto-trigger	Done

Provider Adapters (`app/agent/providers/`)

File	Purpose	Status
`base.py`	`LLMAdapter` ABC, `FallbackAdapter` (rotate on failure), factory functions	Done
`openai_adapter.py`	OpenAI tool_calls mapping, GPT-4o vision	Done
`anthropic_adapter.py`	Anthropic tool_use/tool_result blocks, Claude Sonnet vision	Done
`ollama_adapter.py`	Ollama HTTP `/api/chat`, llava vision	Done

Policy Gates (`app/policy/`)

File	Purpose	Status
`domain.py`	Domain allowlist, RFC-1918/loopback/link-local deny	Done
`gate.py`	Pre-tool and pre-fetch policy checks with `PolicyVerdict`	Done
`redaction.py`	Secret pattern redaction (API keys, JWTs, private keys)	Done

Observability (`app/observability/`)

File	Purpose	Status
`events.py`	`EventBus` + 7 typed events: `run_start`, `step_start`, `tool_dispatch`, `tool_result`, `policy_denied`, `step_end`, `run_end`	Done
`trace.py`	`TraceCollector`, `RunSummary` JSON serialization, `persist_trace()` / `load_trace()` via storage	Done

API Layer

File	Purpose	Status
`agent_routes.py`	`POST /api/agent/run`, `GET /api/agent/status/{run_id}`. 503 when disabled	Done
`routes.py`	Core crawl/markdown/batch/cache REST endpoints	Done
`job_routes.py`	Job CRUD, session status, Cloud Tasks worker	Done
`jobs.py`	`JobType` enum (incl. `AGENT_RUN`), `JobManager`, `JobProcessor`	Done
`models.py`	All Pydantic models incl. `AgentRunRequest/Response`	Done

Anti-Detection (`app/`)

File	Purpose	Status
`stealth.py`	playwright-stealth patches, tracker domain blocking	Done
`proxy.py`	Per-request proxy resolution with env fallback	Done

Mesh (`app/mesh/`)

File	Purpose	Status
`models.py`	Wire protocol models: NodeInfo, NodeLoad, MeshToolRequest/Response, PeerState	Done
`auth.py`	HMAC-SHA256 token signing/verification with 60s TTL	Done
`client.py`	httpx async client for join, heartbeat, leave, execute_tool	Done
`coordinator.py`	Lifecycle, peer table, heartbeat loop with seed retry	Done
`routes.py`	`/mesh/*` endpoints — join, heartbeat, execute, leave, peers, status	Done
`router.py`	Load scoring + target selection (pure logic, no I/O)	Done
`dispatcher.py`	MeshDispatcher wrapping local Dispatcher for transparent routing	Done

Infrastructure

File	Purpose	Status
`config.py`	All env vars incl. agent + provider + ghost + proxy + stealth config	Done
`storage.py`	User-partitioned storage (local filesystem / GCS)	Done
`crawler.py`	Playwright crawling engine with proxy support	Done
`markdown.py`	HTML to markdown conversion	Done
`browser.py`	Browser automation — Chromium + Camoufox engines	Done
`browser_pool.py`	Persistent browser pool with lease/return pattern	Done
`stream.py`	CDP screencast → WebSocket/MJPEG relay + interactive commands	Done

Agent State Machine

INIT -> PLAN -> EXECUTE_TOOL -> OBSERVE -> PLAN -> ... -> RESPOND -> STOP
                     |                                        |
                     +-- policy_denied ---------------------->+
                     +-- max_steps / max_wall_time / max_failures -> STOP
                     +-- no_op_loop (3x empty) ------------> STOP
                     +-- blocked (ghost trigger) -----------> GHOST -> OBSERVE

Stop conditions enforced every iteration:

max_steps (default: 12)
max_wall_time (default: 90s)
max_failures (default: 3)
no_op_loop (3 consecutive empty responses)
policy_denied (blocked tool/domain)
completed (agent responds with text)

Anti-Detection

Three layers of anti-detection that stack together. Prevention stops blocks before they happen. Ghost Protocol handles them after.

Camoufox Engine

Pluggable anti-detect browser with C++-level fingerprint spoofing. No manual user-agent tricks — Camoufox generates realistic fingerprints per context at the browser level, including canvas, WebGL, fonts, and navigator properties.

# Switch engine (default: chromium)
BROWSER_ENGINE=camoufox

Per-Request Proxy

Route crawl traffic through residential, datacenter, or custom proxy pools. Per-request override with env-based defaults. Full Playwright-compatible proxy config.

# Env-based default
PROXY_SERVER=http://proxy.example.com:10001
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password

# Or per-request
curl -X POST http://localhost:6792/api/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "proxy": {
        "server": "http://proxy.example.com:10001",
        "username": "your_username",
        "password": "your_password"
      }
    }
  }'

Stealth Mode

Opt-in playwright-stealth patches for Chromium (skipped for Camoufox where it's built-in). Blocks 20+ tracking/analytics domains (Google Analytics, DataDome, PerimeterX, etc.) to reduce fingerprint surface.

STEALTH_ENABLED=true
BLOCK_TRACKING_DOMAINS=true

Ghost Protocol

When a crawl result signals an anti-bot block (Cloudflare challenge, CAPTCHA, empty SPA shell), the agent can switch to cloak mode:

Take a full-page screenshot via Playwright
Send the image to a vision-capable LLM (Claude Sonnet or GPT-4o)
Extract content from the rendered pixels
Return extracted text with render_mode: "ghost" in the trace

This bypasses DOM-based anti-bot detection entirely.

Requires AGENT_GHOST_ENABLED=true. Auto-triggers on detected blocks when AGENT_GHOST_AUTO_TRIGGER=true.

Mesh

Agents talking to agents. Every Grub instance is both a worker and a coordinator. Local node offloads to cloud, cloud delegates to local. Tool calls cross the wire transparently.

Node A (local)                    Node B (cloud)
┌─────────────┐                  ┌─────────────┐
│ AgentEngine  │                  │ AgentEngine  │
│     ↓        │                  │     ↓        │
│ MeshDispatcher ──── HTTP ────→ MeshDispatcher │
│     ↓        │                  │     ↓        │
│ Dispatcher   │                  │ Dispatcher   │
│     ↓        │                  │     ↓        │
│ ToolRegistry │                  │ ToolRegistry │
└─────────────┘                  └─────────────┘
       ↕ heartbeat (15s)                ↕
       └────────────────────────────────┘

How it works:

Discovery — nodes join via seed peer list, then gossip (1-hop) to learn about others
Heartbeat — every 15s, nodes exchange load metrics. 3 missed = unhealthy. 2 min = removed
Routing — MeshDispatcher scores all nodes by load, locality, and affinity, then routes tool calls to the best node
1-hop max — Node A → B only, never A → B → C. Prevents routing loops
Local fallback — if remote execution fails, falls back to local Dispatcher
HMAC auth — all mesh traffic is signed with a shared secret (SHA-256, 60s TTL)

Run a 2-Node Mesh Locally

# Docker Compose (recommended)
./deploy.sh mesh           # Linux/Mac
./deploy.ps1 -Target mesh  # Windows

# Verify
curl http://localhost:6792/mesh/peers  # Node A sees Node B
curl http://localhost:6793/mesh/peers  # Node B sees Node A

Connect Local to Cloud Run

# Deploy to Cloud Run with mesh
./deploy.sh cloudrun latest --mesh-peer http://your-local-ip:6792 --mesh-secret mysecret

# Start local node
MESH_ENABLED=true MESH_SECRET=mysecret MESH_PEERS=https://your-cloud-run-url \
  MESH_ADVERTISE_URL=http://your-local-ip:6792 \
  uvicorn app.main:app --port 6792

Manual Setup

# Node A
MESH_ENABLED=true MESH_NODE_NAME=local MESH_SECRET=test123 \
  MESH_ADVERTISE_URL=http://localhost:6792 \
  uvicorn app.main:app --port 6792

# Node B
MESH_ENABLED=true MESH_NODE_NAME=cloud MESH_SECRET=test123 \
  MESH_PEERS=http://localhost:6792 \
  MESH_ADVERTISE_URL=http://localhost:8081 \
  uvicorn app.main:app --port 8081

When mesh is disabled (MESH_ENABLED=false, the default), Grub operates as a normal single-node crawler with zero mesh overhead.

Live Stream

Watch the crawler work in real-time. A persistent pool of warm Chromium instances streams viewport frames over WebSocket or MJPEG.

WebSocket — connect and send interactive commands:

const ws = new WebSocket("ws://localhost:6792/stream/my-session?url=https://example.com");
ws.onmessage = (e) => {
  const msg = JSON.parse(e.data);
  if (msg.type === "frame") document.getElementById("viewport").src = "data:image/jpeg;base64," + msg.data;
};
// Navigate, click, scroll, type — all over the same socket
ws.send(JSON.stringify({ action: "navigate", url: "https://example.com/pricing" }));
ws.send(JSON.stringify({ action: "click", selector: "#signup-btn" }));
ws.send(JSON.stringify({ action: "scroll", direction: "down" }));

MJPEG — drop it in an <img> tag, instant video:

<img src="http://localhost:6792/stream/my-session/mjpeg?url=https://example.com" />

Requires BROWSER_STREAM_ENABLED=true. Each Chromium instance uses ~150-300MB RAM.

Quick Start

Local Development

git clone <repo>
cd grub-crawl
cp .env.example .env
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 6792

Enable Agent Mode B

# Add to .env
AGENT_ENABLED=true
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...
AGENT_PROVIDER=anthropic

Submit an Agent Task

curl -X POST http://localhost:6792/api/agent/run \
  -H "Content-Type: application/json" \
  -d '{
    "task": "Find the pricing page on example.com and extract plan details",
    "max_steps": 10,
    "allowed_domains": ["example.com"]
  }'

Docker

# Single node
./deploy.sh local            # or ./deploy.ps1 -Target local

# 2-node mesh
./deploy.sh mesh             # or ./deploy.ps1 -Target mesh

# Cloud Run
./deploy.sh cloudrun v1.0.0  # or ./deploy.ps1 -Target cloudrun -Tag v1.0.0

# Cloud Run + mesh (connect to local node)
./deploy.sh cloudrun v1.0.0 --mesh-peer http://your-ip:6792 --mesh-secret mykey

Anti-Detection (Camoufox + Proxy)

# Add to .env
BROWSER_ENGINE=camoufox
STEALTH_ENABLED=true
BLOCK_TRACKING_DOMAINS=true

# Optional: proxy
PROXY_SERVER=http://proxy.example.com:10001
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password

Ghost Protocol (anti-bot bypass)

# Add to .env
AGENT_GHOST_ENABLED=true

curl -X POST http://localhost:6792/api/agent/ghost \
  -H "Content-Type: application/json" \
  -d '{"url": "https://blocked-site.com"}'

Live Browser Stream

# Add to .env
BROWSER_STREAM_ENABLED=true
BROWSER_POOL_SIZE=2

# MJPEG (open in browser)
open "http://localhost:6792/stream/demo/mjpeg?url=https://example.com"

Configuration

Server

HOST (default: 0.0.0.0)
PORT (default: 6792)
DEBUG (default: false)

Storage

STORAGE_PATH (default: ./storage)
RUNNING_IN_CLOUD (default: false)
GCS_BUCKET_NAME
GOOGLE_CLOUD_PROJECT

Authentication

DISABLE_AUTH (default: false)
GNOSIS_AUTH_URL (default: http://gnosis-auth:5000)

Browser Engine

BROWSER_ENGINE — chromium | camoufox (default: chromium)

Crawling

MAX_CONCURRENT_CRAWLS (default: 5)
CRAWL_TIMEOUT (default: 30)
ENABLE_JAVASCRIPT (default: true)
ENABLE_SCREENSHOTS (default: false)

Proxy

PROXY_SERVER — proxy URL (e.g. http://proxy:10001)
PROXY_USERNAME
PROXY_PASSWORD
PROXY_BYPASS — comma-separated bypass list

Stealth

STEALTH_ENABLED (default: false) — playwright-stealth patches
BLOCK_TRACKING_DOMAINS (default: false) — block analytics/tracking requests

Agent (Mode B)

AGENT_ENABLED (default: false)
AGENT_MAX_STEPS (default: 12)
AGENT_MAX_WALL_TIME_MS (default: 90000)
AGENT_MAX_FAILURES (default: 3)
AGENT_ALLOWED_TOOLS — comma-separated allowlist
AGENT_ALLOWED_DOMAINS — comma-separated allowlist
AGENT_BLOCK_PRIVATE_RANGES (default: true)
AGENT_REDACT_SECRETS (default: true)

LLM Providers

AGENT_PROVIDER — openai | anthropic | ollama (default: openai)
OPENAI_API_KEY
OPENAI_MODEL (default: gpt-4.1-mini)
ANTHROPIC_API_KEY
ANTHROPIC_MODEL (default: claude-3-5-sonnet-latest)
OLLAMA_BASE_URL (default: http://localhost:11434)
OLLAMA_MODEL (default: llama3.1:8b-instruct)

Ghost Protocol

AGENT_GHOST_ENABLED (default: false)
AGENT_GHOST_AUTO_TRIGGER (default: true)
AGENT_GHOST_VISION_PROVIDER — inherits from AGENT_PROVIDER
AGENT_GHOST_MAX_IMAGE_WIDTH (default: 1280)

Mesh

MESH_ENABLED (default: false) — master switch
MESH_PEERS — comma-separated seed peer URLs
MESH_NODE_NAME — human-readable name (default: hostname)
MESH_SECRET — shared HMAC secret for inter-node auth
MESH_ADVERTISE_URL — URL peers use to reach this node
MESH_PREFER_LOCAL (default: true) — bias toward local execution
MESH_HEARTBEAT_INTERVAL_S (default: 15)
MESH_PEER_TIMEOUT_S (default: 45) — mark unhealthy after this
MESH_PEER_REMOVE_S (default: 120) — remove from peer table after this
MESH_REMOTE_TIMEOUT_MS (default: 35000) — timeout for remote tool calls

Live Stream

BROWSER_POOL_SIZE (default: 1)
BROWSER_STREAM_ENABLED (default: false)
BROWSER_STREAM_QUALITY (default: 25) — JPEG quality 1-100
BROWSER_STREAM_MAX_WIDTH (default: 854)
BROWSER_STREAM_MAX_LEASE_SECONDS (default: 300)

Response Contract

POST /api/markdown returns:

success, url, final_url, status_code, markdown, markdown_plain, content, render_mode, wait_strategy, timings_ms, blocked, block_reason, captcha_detected, http_error_family, body_char_count, body_word_count, visible_char_count, visible_word_count, visible_similarity, quarantined, quarantine_reason, policy_flags, content_quality, extractor_version, normalized_url, content_hash

Content Quality

blocked — anti-bot/captcha/challenge
empty — very low signal
minimal — thin/error pages
sufficient — usable for summarization

Do not summarize unless content_quality == "sufficient".

Prompt Injection Defense

quarantined=true means the extractor detected instruction-like text in extracted content that was not present in the page's visible rendered text (common in .sr-only/visually-hidden abuse).
When quarantined, content_quality is downgraded to minimal, policy_flags includes hidden_text_suspected and quarantined, and content/markdown outputs are blanked (fail-closed).

Error Format

{"error": "http_error|validation_error|internal_error", "status": 400, "details": {}}

Development Status

Phase 1: Core Infrastructure ✅

Phase 2: Crawling ✅

Phase 3: Agent Module ✅

Agent core — state machine, types, errors (W1)
Unified tool contract — dispatcher with timeout/retry (W2)
Policy gates — domain allowlist, private-range deny, redaction (W3)
Observability — EventBus, TraceCollector, RunSummary persistence (W4)
API wiring — /api/agent/run, /api/agent/status, JobType.AGENT_RUN (W5)
Provider adapters — OpenAI, Anthropic, Ollama with fallback (W6)
Config flags — agent, provider, ghost, stream settings (W7)

Phase 4: Ghost Protocol ✅

Cloak-mode trigger detection (W8)
Screenshot capture pipeline (W8)
Vision extraction via Claude/GPT-4o (W8)
Fallback chain in engine (W8)
Ghost tool for external callers (W8)
Ghost MCP tool + REST endpoint (W8)

Phase 5: Live Browser Stream ✅

Persistent browser pool with lease/return (W9)
CDP screencast relay (W9)
WebSocket endpoint with interactive commands (W9)
MJPEG fallback stream (W9)
Stream status + pool status endpoints (W9)

Phase 5.5: Anti-Detection ✅

Camoufox anti-detect browser engine (W10)
Per-request proxy with env fallback (W10)
Stealth patches for Chromium (W10)
Tracker/analytics domain blocking (W10)
Anthropic vision format detection fix (W10)

Phase 6: Mesh Coordinator ✅

Peer discovery with gossip (1-hop) (W11)
HMAC-SHA256 inter-node auth (W11)
Heartbeat loop with load metrics + seed retry (W11)
MeshDispatcher — transparent cross-node tool routing (W12)
Load-based scoring with locality/affinity bonus (W12)
Deploy scripts — local, mesh, Cloud Run (W12)
Docker Compose 2-node mesh topology (W12)
Embedded landing page (grub-site) (W12)

Phase 7: Hardening

Unit test suite — 176 tests across all modules
Error handling improvements
Monitoring and alerting
Performance optimization

See MASTER_PLAN.md for the full architecture plan.

License

Grub Crawler Project License

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.claude		.claude
app		app
site		site
tests		tests
.env.cloud		.env.cloud
.env.example		.env.example
.env.mesh		.env.mesh
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CUSTOMER_ID_IMPLEMENTATION.md		CUSTOMER_ID_IMPLEMENTATION.md
Dockerfile		Dockerfile
MASTER_PLAN.md		MASTER_PLAN.md
README.md		README.md
SERVICE_REGISTRY.md		SERVICE_REGISTRY.md
deploy.ps1		deploy.ps1
deploy.sh		deploy.sh
docker-compose.mesh.yml		docker-compose.mesh.yml
docker-compose.yml		docker-compose.yml
gnosis-crawl.py		gnosis-crawl.py
gnosis_services.json		gnosis_services.json
pytest.ini		pytest.ini
requirements.txt		requirements.txt

DeepBlueDynamics/grubcrawler

Folders and files

Latest commit

History

Repository files navigation

Why Grub

API Endpoints

Core Crawling

Agent (Mode B)

Job Management

Remote Cache

Session Management

Live Stream

Mesh

System

MCP Tools (grub-crawl.py)

Internal Modules

Agent Core (app/agent/)

Provider Adapters (app/agent/providers/)

Policy Gates (app/policy/)

Observability (app/observability/)

API Layer

Anti-Detection (app/)

Mesh (app/mesh/)

Infrastructure

Agent State Machine

Anti-Detection

Camoufox Engine

Per-Request Proxy

Stealth Mode

Ghost Protocol

Mesh

Run a 2-Node Mesh Locally

Connect Local to Cloud Run

Manual Setup

Live Stream

Quick Start

Local Development

Enable Agent Mode B

Submit an Agent Task

Docker

Anti-Detection (Camoufox + Proxy)

Ghost Protocol (anti-bot bypass)

Live Browser Stream

Configuration

Server

Storage

Authentication

Browser Engine

Crawling

Proxy

Stealth

Agent (Mode B)

LLM Providers

Ghost Protocol

Mesh

Live Stream

Response Contract

Content Quality

Prompt Injection Defense

Error Format

Development Status

Phase 1: Core Infrastructure ✅

Phase 2: Crawling ✅

Phase 3: Agent Module ✅

Phase 4: Ghost Protocol ✅

Phase 5: Live Browser Stream ✅

Phase 5.5: Anti-Detection ✅

Phase 6: Mesh Coordinator ✅

Phase 7: Hardening

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Agent Core (`app/agent/`)

Provider Adapters (`app/agent/providers/`)

Policy Gates (`app/policy/`)

Observability (`app/observability/`)

Anti-Detection (`app/`)

Mesh (`app/mesh/`)

Packages