Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK#125
Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK#125KylinMountain wants to merge 5 commits intoVectifyAI:mainfrom
Conversation
- Add PageIndexClient with index/retrieve/query workflow - Add workspace parameter for automatic JSON-based persistence - Add query_stream() for token-level streaming output - Add ChatGPT_API_stream() generator in utils.py - Add test_client.py demo using DeepSeek-R1 paper Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add tests/pdfs/*.pdf to .gitignore (PDF is downloaded by test_client.py) - Add verbose=True to query_agent(): streams tool calls with name/args/output - Fix asyncio usage for run_streamed() (not an async context manager) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Test result: |
There was a problem hiding this comment.
Pull request overview
Adds a retrieval + QA layer on top of the existing PageIndex tree builders by introducing tool-style retrieval functions and a high-level PageIndexClient that uses the OpenAI Agents SDK to autonomously navigate document structure and fetch relevant page/line content for answering questions.
Changes:
- Added
pageindex/retrieve.pywith 3 JSON-returning retrieval tools: document metadata, token-efficient structure, and page/line content retrieval. - Added
pageindex/client.pyimplementingPageIndexClient(indexing, workspace persistence, and agent-driven querying). - Added a runnable demo script (
test_client.py) and updated exports/dependencies (pageindex/__init__.py,requirements.txt).
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
test_client.py |
Demo script that downloads a PDF, indexes it, and runs agent queries including workspace reload. |
requirements.txt |
Adds openai-agents dependency for the agent-based client. |
pageindex/utils.py |
Adds streaming helper and tree/node printing/mapping utilities. |
pageindex/retrieve.py |
Implements the 3 retrieval tool functions (metadata, structure, page/line content). |
pageindex/client.py |
Introduces PageIndexClient with indexing, persistence, and OpenAI Agents SDK integration. |
pageindex/__init__.py |
Exposes retrieval tools and PageIndexClient at package top-level. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return asyncio.run(_run_verbose()) | ||
|
|
There was a problem hiding this comment.
Verbose mode also calls asyncio.run(_run_verbose()), which has the same event-loop reentrancy problem and will break in Jupyter/async servers. Making query_agent async (or offering a separate async entrypoint) would avoid this runtime failure.
| return asyncio.run(_run_verbose()) | |
| # In synchronous contexts without a running event loop, it is safe to use asyncio.run. | |
| # If an event loop is already running (e.g., Jupyter or an async server), instruct | |
| # callers to use the async_query_agent coroutine instead to avoid event-loop reentrancy. | |
| try: | |
| asyncio.get_running_loop() | |
| except RuntimeError: | |
| return asyncio.run(_run_verbose()) | |
| else: | |
| raise RuntimeError( | |
| "query_agent(verbose=True) cannot be used from within an existing asyncio " | |
| "event loop. Use 'await async_query_agent(..., verbose=True)' instead." | |
| ) | |
| async def async_query_agent(self, doc_id: str, prompt: str, verbose: bool = False) -> str: | |
| """ | |
| Async variant of query_agent. | |
| This coroutine can be safely awaited from within an existing asyncio event loop, | |
| including Jupyter notebooks and async servers. | |
| """ | |
| client_self = self | |
| @function_tool | |
| def get_document() -> str: | |
| """Get document metadata: status, page count, name, and description.""" | |
| return client_self.get_document(doc_id) | |
| @function_tool | |
| def get_document_structure() -> str: | |
| """Get the document's full tree structure (without text) to find relevant sections.""" | |
| return client_self.get_document_structure(doc_id) | |
| @function_tool | |
| def get_page_content(pages: str) -> str: | |
| """ | |
| Get the text content of specific pages or line numbers. | |
| Use tight ranges: e.g. '5-7' for pages 5 to 7, '3,8' for pages 3 and 8, '12' for page 12. | |
| For Markdown documents, use line numbers from the structure's line_num field. | |
| """ | |
| return client_self.get_page_content(doc_id, pages) | |
| agent = Agent( | |
| name="PageIndex", | |
| instructions=AGENT_SYSTEM_PROMPT, | |
| tools=[get_document, get_document_structure, get_page_content], | |
| model=self.model, | |
| ) | |
| if not verbose: | |
| # Run the synchronous Runner.run_sync in a background thread to avoid | |
| # blocking the event loop. | |
| result = await asyncio.to_thread(Runner.run_sync, agent, prompt) | |
| return result.final_output | |
| # verbose mode: stream events and print tool calls | |
| async def _run_verbose(): | |
| turn = 0 | |
| stream = Runner.run_streamed(agent, prompt) | |
| async for event in stream.stream_events(): | |
| if not isinstance(event, RunItemStreamEvent): | |
| continue | |
| if event.name == "tool_called": | |
| turn += 1 | |
| raw = event.item.raw_item | |
| args = getattr(raw, "arguments", "{}") | |
| print(f"\n[Turn {turn}] → {raw.name}({args})") | |
| elif event.name == "tool_output": | |
| output = str(event.item.output) | |
| preview = output[:200] + "..." if len(output) > 200 else output | |
| print(f" ← {preview}") | |
| return stream.final_output | |
| return await _run_verbose() |
| import os | ||
| import uuid | ||
| import json | ||
| import asyncio | ||
| from pathlib import Path | ||
| from typing import List, Dict, Any, Optional | ||
|
|
||
| from agents import Agent, Runner, function_tool | ||
| from agents.stream_events import RunItemStreamEvent | ||
|
|
||
| from .page_index import page_index | ||
| from .page_index_md import md_to_tree | ||
| from .retrieve import tool_get_document, tool_get_document_structure, tool_get_page_content | ||
| from .utils import remove_fields, create_node_mapping | ||
|
|
There was a problem hiding this comment.
There are several unused imports here (List, Dict, Any, Optional, plus remove_fields / create_node_mapping). Please remove unused imports to keep the module tidy (and avoid failing lint/CI if enabled).
| from .page_index_md import md_to_tree No newline at end of file | ||
| from .page_index_md import md_to_tree | ||
| from .retrieve import tool_get_document, tool_get_document_structure, tool_get_page_content | ||
| from .client import PageIndexClient |
There was a problem hiding this comment.
Importing PageIndexClient at package import time makes openai-agents (and its agents module) a hard requirement even for users who only want the core indexing utilities. If you want Agents support to be optional, consider moving this import behind a try/except (and raising a helpful error only when PageIndexClient is used) or exposing it via a separate module/extra.
| from .client import PageIndexClient | |
| try: | |
| # Importing PageIndexClient may require optional dependencies (e.g., openai-agents). | |
| # Wrap this in a try/except so that importing the core package does not | |
| # force those optional dependencies to be installed. | |
| from .client import PageIndexClient | |
| except ImportError as _pageindex_client_import_error: | |
| class _MissingPageIndexClient: | |
| """ | |
| Placeholder for PageIndexClient when its optional dependencies are not installed. | |
| Any attempt to instantiate or otherwise use this object will raise a clear | |
| ImportError explaining how to enable this functionality. | |
| """ | |
| def __init__(self, error: ImportError) -> None: | |
| self._error = error | |
| def __call__(self, *args, **kwargs): | |
| raise ImportError( | |
| "PageIndexClient is unavailable because its optional dependencies " | |
| "could not be imported. Install the 'openai-agents' extra (or any " | |
| "other required optional dependencies) to use PageIndexClient." | |
| ) from self._error | |
| def __getattr__(self, name): | |
| raise ImportError( | |
| "PageIndexClient is unavailable because its optional dependencies " | |
| "could not be imported. Install the 'openai-agents' extra (or any " | |
| "other required optional dependencies) to use PageIndexClient." | |
| ) from self._error | |
| PageIndexClient = _MissingPageIndexClient(_pageindex_client_import_error) # type: ignore[assignment] |
| def _parse_pages(pages: str) -> list[int]: | ||
| """Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints.""" | ||
| result = [] | ||
| for part in pages.split(','): | ||
| part = part.strip() | ||
| if '-' in part: | ||
| start, end = part.split('-', 1) | ||
| result.extend(range(int(start.strip()), int(end.strip()) + 1)) | ||
| else: | ||
| result.append(int(part)) |
There was a problem hiding this comment.
_parse_pages() will happily accept reversed ranges like "7-5" (yielding an empty range), non-positive page numbers, and very large ranges that can allocate huge lists. Consider validating start <= end, rejecting values < 1, and enforcing a max range size (returning a clear error JSON) so tool calls fail loudly instead of silently returning empty content or causing memory spikes.
| def _parse_pages(pages: str) -> list[int]: | |
| """Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints.""" | |
| result = [] | |
| for part in pages.split(','): | |
| part = part.strip() | |
| if '-' in part: | |
| start, end = part.split('-', 1) | |
| result.extend(range(int(start.strip()), int(end.strip()) + 1)) | |
| else: | |
| result.append(int(part)) | |
| # Hard cap on how many pages can be requested/expanded in a single call. | |
| # This prevents accidental or malicious requests like "1-1000000000" from | |
| # causing huge in-memory lists to be allocated. | |
| MAX_PAGE_RANGE = 10000 | |
| def _parse_pages(pages: str) -> list[int]: | |
| """Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints. | |
| Validation rules: | |
| - Page numbers must be positive integers (>= 1). | |
| - Ranges must have start <= end. | |
| - The total number of pages produced may not exceed MAX_PAGE_RANGE. | |
| Raises: | |
| ValueError: If the input is malformed or violates any of the rules above. | |
| """ | |
| if pages is None: | |
| raise ValueError("pages specification must be a non-empty string") | |
| pages_str = str(pages).strip() | |
| if not pages_str: | |
| # Empty pages string: treat as no pages requested. | |
| return [] | |
| result: list[int] = [] | |
| total_count = 0 | |
| for part in pages_str.split(','): | |
| part = part.strip() | |
| if not part: | |
| continue | |
| if '-' in part: | |
| start_str, end_str = part.split('-', 1) | |
| try: | |
| start = int(start_str.strip()) | |
| end = int(end_str.strip()) | |
| except ValueError as exc: | |
| raise ValueError(f"Invalid page range '{part}': not an integer") from exc | |
| if start < 1 or end < 1: | |
| raise ValueError(f"Invalid page range '{part}': pages must be >= 1") | |
| if start > end: | |
| raise ValueError(f"Invalid page range '{part}': start must be <= end") | |
| range_len = end - start + 1 | |
| if total_count + range_len > MAX_PAGE_RANGE: | |
| raise ValueError( | |
| f"Requested pages exceed maximum allowed ({MAX_PAGE_RANGE})" | |
| ) | |
| result.extend(range(start, end + 1)) | |
| total_count += range_len | |
| else: | |
| try: | |
| page_num = int(part) | |
| except ValueError as exc: | |
| raise ValueError(f"Invalid page number '{part}': not an integer") from exc | |
| if page_num < 1: | |
| raise ValueError(f"Invalid page number '{part}': must be >= 1") | |
| if total_count + 1 > MAX_PAGE_RANGE: | |
| raise ValueError( | |
| f"Requested pages exceed maximum allowed ({MAX_PAGE_RANGE})" | |
| ) | |
| result.append(page_num) | |
| total_count += 1 |
| """Extract text for specific PDF pages (1-indexed).""" | ||
| pdf_pages = get_page_tokens(doc_info['path']) | ||
| total = len(pdf_pages) | ||
| results = [] | ||
| for p in page_nums: | ||
| if 1 <= p <= total: | ||
| results.append({'page': p, 'content': pdf_pages[p - 1][0]}) |
There was a problem hiding this comment.
For PDFs, _get_pdf_page_content() calls get_page_tokens() which extracts and tokenizes all pages on every tool call, even when only a few pages are requested. This is likely to be a major slowdown when the agent iterates. Consider caching the page list per doc_id during indexing, or extracting only the requested pages (without tokenization) for this tool.
| """Extract text for specific PDF pages (1-indexed).""" | |
| pdf_pages = get_page_tokens(doc_info['path']) | |
| total = len(pdf_pages) | |
| results = [] | |
| for p in page_nums: | |
| if 1 <= p <= total: | |
| results.append({'page': p, 'content': pdf_pages[p - 1][0]}) | |
| """Extract text for specific PDF pages (1-indexed) without tokenizing all pages.""" | |
| path = doc_info['path'] | |
| total_pages = get_number_of_pages(path) | |
| # Keep only valid page numbers within the document range. | |
| valid_pages = [p for p in page_nums if 1 <= p <= total_pages] | |
| if not valid_pages: | |
| return [] | |
| # Fetch text only for the requested pages to avoid processing the entire document. | |
| page_texts = get_text_of_pdf_pages(path, valid_pages) | |
| results = [] | |
| for p, text in zip(valid_pages, page_texts): | |
| results.append({'page': p, 'content': text}) |
| r = requests.get(PDF_URL) | ||
| with open(PDF_PATH, "wb") as f: | ||
| f.write(r.content) |
There was a problem hiding this comment.
The PDF download does not set a timeout or validate the HTTP response. Add a reasonable timeout and call raise_for_status() (and ideally download with stream=True) to avoid hangs and silently writing error pages to disk.
| r = requests.get(PDF_URL) | |
| with open(PDF_PATH, "wb") as f: | |
| f.write(r.content) | |
| with requests.get(PDF_URL, stream=True, timeout=30) as r: | |
| r.raise_for_status() | |
| with open(PDF_PATH, "wb") as f: | |
| for chunk in r.iter_content(chunk_size=8192): | |
| if chunk: # filter out keep-alive chunks | |
| f.write(chunk) |
| from .utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list | ||
| except ImportError: | ||
| from utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list |
There was a problem hiding this comment.
These imports include get_text_of_pdf_pages and structure_to_list, but they are not used in this module. Removing unused imports will avoid confusion and keeps the tool module minimal.
| from .utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list | |
| except ImportError: | |
| from utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list | |
| from .utils import get_page_tokens, get_number_of_pages, remove_fields | |
| except ImportError: | |
| from utils import get_page_tokens, get_number_of_pages, remove_fields |
| elif mode == "md" or (mode == "auto" and is_md): | ||
| print(f"Indexing Markdown: {file_path}") | ||
| result = asyncio.run(md_to_tree( | ||
| md_path=file_path, | ||
| if_thinning=False, | ||
| if_add_node_summary='yes', | ||
| summary_token_threshold=200, | ||
| model=self.model, | ||
| if_add_doc_description='yes', | ||
| if_add_node_text='yes', | ||
| if_add_node_id='yes' | ||
| )) | ||
| self.documents[doc_id] = { |
There was a problem hiding this comment.
asyncio.run(md_to_tree(...)) will raise RuntimeError: asyncio.run() cannot be called from a running event loop in notebooks / async apps. Consider providing an async index_async() / query_agent_async() API and using await when already inside an event loop (or using a loop-aware helper) so the client works in common interactive environments.
What this PR adds
The upstream library provides
page_index()andmd_to_tree()for building document tree structures, but has no retrieval or QA layer. This PR adds that layer.New:
pageindex/retrieve.py— 3 retrieval tool functionsThree functions that expose structured document access:
tool_get_document(documents, doc_id)— metadata (name, description, type, page count)tool_get_document_structure(documents, doc_id)— full tree JSON without text (token-efficient)tool_get_page_content(documents, doc_id, pages)— page text by range ("5-7","3,8","12")Works with both PDF (page numbers) and Markdown (line numbers).
New:
pageindex/client.py—PageIndexClientHigh-level SDK client:
index(file_path)— index a PDF or Markdown file, returnsdoc_idquery_agent(doc_id, prompt, verbose=False)— runs an OpenAI Agents SDK agent that calls the 3 tools autonomously to answer the questionquery(doc_id, prompt)/query_stream(doc_id, prompt)— convenience wrappersworkspaceparameter for JSON-based persistence across sessionsDemo: OpenAI Agents SDK
The agent navigates the document structure itself — no manual retrieval logic needed:
With
verbose=True, each tool call (name, args, result preview) is printed in real time.Test plan
pip install openai-agentspython test_client.py— downloads DeepSeek-R1 PDF, indexes it, runs agent queryclient.query_agent(doc_id, "...", verbose=True)— observe tool call sequencePageIndexClient(workspace=...)— query works without re-indexing