EchoNotes is a Python-based application that monitors a folder for new files, extracts the content (text, audio, video), summarizes it using a local instance of an LLM model, and saves the summarized output back to disk. It supports offline operation and can handle multiple file formats, including PDFs, Word documents, text files, video/audio files.
- Monitors a directory for new files (PDF, DOCX, TXT, common audio formats, and video formats).
- Text Extraction:
- PDF files (via PyPDF2 and Tesseract for OCR)
- Word documents (via python-docx)
- Plain text files
- Audio files such as MP3, WAV, M4A, AAC, FLAC, OGG, OPUS, WMA, AIFF, MP2, AMR, and AC3
- Video files such as MP4, AVI, MOV, MKV, WEBM, and M4V
- Non-MP3 audio is normalized to MP3 via FFmpeg before transcription
- Summarization:
- Supports explicit LLM providers instead of assuming Ollama-style APIs.
- Supported providers: Open WebUI, Ollama, OpenAI, Claude/Anthropic, OpenRouter, and a legacy generic generate endpoint.
- Supports customizable markdown prompts.
- Offline Operation:
- All processing (text extraction, transcription, summarization) can be done offline.
- Pre-downloads WhisperX ASR models and handles everything locally.
- Background Worker Pool:
- Files are queued immediately by the watcher and processed by persistent workers.
- Each worker keeps its model loaded to avoid per-file startup costs.
- The container is intended to run continuously while monitoring the mounted
incoming/folder.
- Automatic Transcript Formatting:
- Audio and video transcripts can be reformatted into readable Markdown before saving.
- Formatting falls back to the raw transcript if the formatter fails.
- Speaker Labels:
- WhisperX diarization can label timestamped transcript lines as
Speaker 1,Speaker 2, and so on. - Speaker diarization requires a Hugging Face token for the pyannote diarization model.
- WhisperX diarization can label timestamped transcript lines as
- Automatic Chunking:
- Large transcripts are chunked and reduced automatically so long meetings do not overflow model context windows.
- Obsidian Export:
- Audio/video jobs can copy the final MP3, transcript, summary, and an Obsidian note into
vault/. - The Obsidian note now includes structured front matter, linked entities, and deterministic sections rendered by Python.
- The Obsidian note template and extraction prompt are loaded from the config area, with image-shipped fallbacks.
- Audio/video jobs can copy the final MP3, transcript, summary, and an Obsidian note into
- Logging: Extensive logging to help track operations and errors.
- Ingestion Hardening:
- Partial-copy files are held until their size stabilizes.
- Temporary files and hidden dot-paths such as
.obsidianand.stfolderare ignored. - A periodic fallback rescan picks up files that arrive on mounted folders where filesystem events are unreliable, such as Syncthing or Windows-backed binds.
Create a config directory and place your runtime files there:
mkdir -p config incoming vault model-cache
cp config.sample.yml config/config.yml
cp summarize-notes.md config/summarize-notes.mdEdit config/config.yml for your LLM endpoint, model, tokens, and any diarization settings.
Run the persistent worker container:
docker run -d --name echonotes \
-v /path/to/incoming:/app/incoming \
-v /path/to/vault:/app/vault \
-v /path/to/config:/app/config \
-v /path/to/model-cache:/app/model-cache \
echonotes:latestExample from the repo root:
docker run -d --name echonotes \
-v "$(pwd)/incoming:/app/incoming" \
-v "$(pwd)/vault:/app/vault" \
-v "$(pwd)/config:/app/config" \
-v "$(pwd)/model-cache:/app/model-cache" \
echonotes:latestThe model cache mount is optional but recommended for local/dev use. If /app/model-cache is empty, EchoNotes downloads the configured WhisperX model at startup. If whisper_model is not configured, it defaults to base on CPU and small on GPU.
Published image tags follow this pattern:
echonotes:latest: CPU imageechonotes:latest-cuda12.8: default GPU imageechonotes:gpu: alias for the default GPU image
Versioned releases follow the same split:
echonotes:1.4.1echonotes:1.4.1-cuda12.8
-
Build the Docker Images:
EchoNotes currently supports two local build variants:
- CPU:
echonotes:latest - GPU:
echonotes:latest-cuda12.8
Build the CPU image:
docker build -t echonotes:latest .The default Docker build is the CPU variant. It uses the official PyTorch CPU wheel path so the image does not pull the CUDA package set.
Build the GPU image:
docker build \ --build-arg IMAGE_VARIANT=gpu \ --build-arg GPU_BASE_IMAGE=nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04 \ --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128 \ --build-arg TORCH_EXTRA_INDEX_URL= \ --build-arg TORCH_PACKAGE_SPEC=torch==2.8.0 \ --build-arg TORCHAUDIO_PACKAGE_SPEC=torchaudio==2.8.0 \ --build-arg TORCH_INSTALL_NO_DEPS=1 \ --build-arg TORCH_PYTHON_DEPS=filelock,fsspec,jinja2,markupsafe,mpmath,networkx,sympy,typing-extensions \ -t echonotes:latest-cuda12.8 .Optional local alias for the GPU image:
docker tag echonotes:latest-cuda12.8 echonotes:gpu
If you want both local variants available in one pass, run both commands above.
The GPU build uses the NVIDIA CUDA runtime image directly and keeps only the extra CUDA libraries that PyTorch still needs beyond that base, which avoids duplicating the full
nvidia-*pip wheel bundle. - CPU:
-
Run the Docker Container:
Run the long-lived worker container with the three runtime mounts:
docker run -d --name echonotes \ -v /path/to/incoming:/app/incoming \ -v /path/to/vault:/app/vault \ -v /path/to/config:/app/config \ -v /path/to/model-cache:/app/model-cache \ echonotes:latest
-
Pre-download WhisperX Models:
The recommended way to pre-download WhisperX models is to mount a host cache directory to
/app/model-cacheand warm it before your first long run.Build the CPU image and warm an empty local cache:
./build.sh --model-cache-dir ./model-cache
Build the GPU image and warm an empty local cache through the NVIDIA runtime:
./build.sh --gpu --model-cache-dir ./model-cache
Those commands build the image first, then warm the cache.
build.shonly performs the automatic warmup step when the target cache directory is empty, unless you explicitly pass--model.To warm the cache without building:
./build.sh --warm-model-cache-only --model-cache-dir ./model-cache
If you omit
--model,build.shpresents an interactive list of supported WhisperX models and lets you choose one. To skip the prompt:./build.sh --warm-model-cache-only --model large-v3 --model-cache-dir ./model-cache ./build.sh --warm-model-cache-only --gpu --model turbo --model-cache-dir ./model-cache
To print the currently supported model names:
./build.sh --list-models
To pre-download without rebuilding, use the warm-only mode:
./build.sh --warm-model-cache-only --model-cache-dir ./model-cache ./build.sh --warm-model-cache-only --gpu --model-cache-dir ./model-cache
You can also warm the cache by starting the real worker once with the cache mounted and then stopping it after startup finishes:
docker run --rm \ -v "$(pwd)/incoming:/app/incoming" \ -v "$(pwd)/vault:/app/vault" \ -v "$(pwd)/config:/app/config" \ -v "$(pwd)/model-cache:/app/model-cache" \ echonotes:latest
For GPU:
docker run --rm --gpus all \ -v "$(pwd)/incoming:/app/incoming" \ -v "$(pwd)/vault:/app/vault" \ -v "$(pwd)/config:/app/config" \ -v "$(pwd)/model-cache:/app/model-cache" \ echonotes:latest-cuda12.8
To override the PyTorch wheel source during build:
./build.sh --torch-index-url https://download.pytorch.org/whl/cpu
You can use Docker Compose to manage the container:
version: '3.8'
services:
echonotes:
image: echonotes:latest
volumes:
- ./incoming:/app/incoming
- ./vault:/app/vault
- ./config:/app/config
- ./model-cache:/app/model-cache
restart: unless-stoppedRun the service with:
docker-compose up -dFor a GPU host, switch the image tag to echonotes:gpu or echonotes:latest-cuda12.8 and add the appropriate GPU runtime settings for your Docker installation.
EchoNotes monitors the /app/incoming directory continuously. When it detects a new file, it processes it according to the file type:
- PDF: Extracts text using PyPDF2 or OCR via Tesseract if needed.
- Word Documents (DOCX): Extracts text using
python-docx. - Text Files (TXT): Reads the plain text.
- Audio Files: Common audio inputs are converted to MP3 with FFmpeg when needed, then transcribed with WhisperX.
- Video Files: Extracts audio using FFmpeg, then transcribes it with WhisperX.
Once the text is extracted, it can be summarized by sending the text and a customizable markdown prompt to a configured LLM provider. If no LLM provider is configured, EchoNotes will still extract and transcribe files, but it will skip LLM-based formatting and summarization.
Files placed in hidden folders or dot-paths such as .obsidian, .stfolder, or .syncthing* are ignored. Temporary partial-download files such as .part, .tmp, and .crdownload are also ignored.
Audio
- Detect supported audio file in incoming folder
- Wait until file size becomes stable
- Move file into working folder
- Convert to MP3 if needed
- Transcribe with WhisperX
- Align timestamps and diarize speakers
- Format transcript with LLM if configured
- Generate summary with LLM if configured
- Create Obsidian note and copy vault artifacts
- Move originals and outputs to completed
Video
- Detect supported video file in incoming folder
- Wait until file size becomes stable
- Move file into working folder
- Extract audio to MP3 with FFmpeg
- Transcribe with WhisperX
- Align timestamps and diarize speakers
- Format transcript with LLM if configured
- Generate summary with LLM if configured
- Create Obsidian note and copy vault artifacts
- Move originals and outputs to completed
Documents
- Detect supported document file in incoming folder
- Wait until file size becomes stable
- Move file into working folder
- Extract text from PDF, DOCX, or TXT
- Use OCR fallback for image PDFs
- Generate summary with LLM if configured
- Move originals and outputs to completed
The application is configured via /app/config/config.yml. The image also includes baked defaults in /app/config-defaults, so if a prompt file is missing from the mounted config directory EchoNotes falls back to the image default where available. An example configuration file is shown below:
path_to_watch: "/app/incoming"
incoming_rescan_interval_seconds: 5 # Periodic fallback scan for missed filesystem events; set 0 to disable
llm:
provider: "openwebui"
model: "gpt-4o-mini"
base_url: "http://openwebui:3000/api"
api_key: "your_api_token_here"
timeout_seconds: null
max_tokens: 2048
whisper_model: "base" # Optional; defaults to 'base' on CPU and 'small' on GPU when omitted
whisper_batch_size: null # Optional; defaults to 16 on GPU and 4 on CPU
whisper_min_batch_size: 1 # On GPU, OOM retries back off down to this batch size
gpu_oom_fallback: "cpu" # One of: cpu, fail
worker_count: 2 # Number of background workers to run concurrently; on GPU start with 1
diarization_enabled: true # Enable WhisperX speaker diarization when configured
diarization_hf_token: "" # Required for speaker labels via pyannote diarization
diarization_model_name: "" # Optional; leave blank for WhisperX default
diarization_num_speakers: null # Optional exact speaker count
diarization_min_speakers: null # Optional lower bound
diarization_max_speakers: null # Optional upper bound
format_transcripts: true # Format audio/video transcripts into readable Markdown before summarization
transcript_format_prompt_path: "/app/config/format-transcript.md" # Optional; built-in prompt is used if missing
summary_prompt_path: "/app/config/summarize-notes.md" # Optional; falls back to /app/config-defaults/summarize-notes.md
vault_path: "/app/vault" # Folder where Obsidian-ready artifacts are copied
obsidian_extract_prompt_path: "/app/config/obsidian-extract.md" # Optional; falls back to /app/config-defaults/obsidian-extract.md
obsidian_template_path: "/app/config/obsidian-template.md" # Optional; defaults next to summarize-notes.md or a built-in template
chunking:
enabled: true
max_input_chars: 24000
target_chunk_chars: 16000
overlap_chars: 400Put your custom runtime files in the mounted /app/config directory:
config.ymlsummarize-notes.mdformat-transcript.mdobsidian-extract.mdobsidian-template.md
If you mount /app/model-cache, WhisperX downloads are reused across container rebuilds and restarts. This is especially useful for local/dev Docker workflows.
The summarization prompt (summarize-notes.md) is used to prepend instructions for summaries. If you want to customize transcript formatting, place format-transcript.md in the same config directory and point transcript_format_prompt_path at it. If no transcript-format prompt exists there, EchoNotes uses a built-in transcript-formatting prompt.
For Obsidian note enrichment, obsidian-extract.md instructs the LLM to return structured JSON for front matter and note sections. Python then renders the final note deterministically from that JSON.
If you mount an Obsidian vault folder at vault_path, EchoNotes also copies audio-ready artifacts there for audio and video jobs:
- The final MP3
- The full transcript markdown
- The summary markdown
- An Obsidian note markdown file
If obsidian_template_path is not provided, EchoNotes looks for obsidian-template.md next to the summarization prompt. The image now ships a default obsidian-template.md in /app/config-defaults, and if that file is unavailable EchoNotes falls back to the built-in plain template.
Structured Obsidian notes can include:
- front matter fields such as
detected_language,inferred_people,inferred_projects, andinferred_topics - deterministic tags such as
echonotes,audioorvideo,transcript,summary, anddiarized - linked entity sections like
[[People/Rob]],[[Projects/Vancity]], and[[Topics/Financial Planning]] - deterministic sections for main ideas, decisions, action items, challenges and risks, and next steps
For speaker labels in audio/video transcripts, configure diarization_hf_token. When diarization is available, EchoNotes writes labels like Speaker 1 and Speaker 2 into the timestamped transcript lines.
On GPU, WhisperX transcription automatically retries with smaller batch sizes if it runs out of memory. If GPU retries still fail and gpu_oom_fallback is set to cpu, EchoNotes retries that file on CPU instead of leaving the queue blocked.
llm.provider is optional. If it is empty, EchoNotes does not call any LLM provider.
openwebui:base_urlshould usually look likehttp://host:3000/apiollama:base_urlshould usually look likehttp://host:11434/apiopenai:base_urlshould usually look likehttps://api.openai.com/v1claudeoranthropic:base_urlshould usually look likehttps://api.anthropic.comopenrouter:base_urlshould usually look likehttps://openrouter.ai/api/v1legacy_generate: keeps compatibility with the older single-endpointapi_urlstyle config
Chunking settings under chunking: apply to LLM-based transcript formatting and summarization.
Set llm.timeout_seconds: null (or 0) to disable the HTTP timeout and let slow local models run until they finish.
The application logs all activities and errors to help with debugging and tracking its operations. The log includes details about:
- Files processed
- Errors encountered
- Summaries generated
- incoming: Monitored input mount where new files are placed for processing.
- working: Temporary folder where files are processed.
- completed: Once processed, files and generated artifacts are moved under
incoming/completed. - vault: Output mount where final MP3, transcript, summary, and Obsidian note are copied.
- config: Mounted runtime config directory for
config.ymland prompt/template overrides. - model-cache: Optional mounted cache directory for WhisperX, Hugging Face, and related model downloads.
We welcome contributions to EchoNotes! Please fork the repository and submit a pull request with your changes.
EchoNotes is licensed under the MIT License.