Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK by KylinMountain · Pull Request #125 · VectifyAI/PageIndex

KylinMountain · 2026-02-28T12:30:54Z

What this PR adds

The upstream library provides page_index() and md_to_tree() for building document tree structures, but has no retrieval or QA layer. This PR adds that layer.

New: `pageindex/retrieve.py` — 3 retrieval tool functions

Three functions that expose structured document access:

tool_get_document(documents, doc_id) — metadata (name, description, type, page count)
tool_get_document_structure(documents, doc_id) — full tree JSON without text (token-efficient)
tool_get_page_content(documents, doc_id, pages) — page text by range ("5-7", "3,8", "12")

Works with both PDF (page numbers) and Markdown (line numbers).

New: `pageindex/client.py` — `PageIndexClient`

High-level SDK client:

index(file_path) — index a PDF or Markdown file, returns doc_id
query_agent(doc_id, prompt, verbose=False) — runs an OpenAI Agents SDK agent that calls the 3 tools autonomously to answer the question
query(doc_id, prompt) / query_stream(doc_id, prompt) — convenience wrappers
workspace parameter for JSON-based persistence across sessions

Demo: OpenAI Agents SDK

The agent navigates the document structure itself — no manual retrieval logic needed:

query_agent(doc_id, "What are the conclusions?")
  Turn 1: get_document()            → confirms status and page count
  Turn 2: get_document_structure()  → reads tree to find relevant sections
  Turn 3: get_page_content("10-13") → fetches targeted page content
  Turn 4: synthesizes answer

With verbose=True, each tool call (name, args, result preview) is printed in real time.

Test plan

pip install openai-agents
python test_client.py — downloads DeepSeek-R1 PDF, indexes it, runs agent query
client.query_agent(doc_id, "...", verbose=True) — observe tool call sequence
Restart Python, reload PageIndexClient(workspace=...) — query works without re-indexing

- Add PageIndexClient with index/retrieve/query workflow - Add workspace parameter for automatic JSON-based persistence - Add query_stream() for token-level streaming output - Add ChatGPT_API_stream() generator in utils.py - Add test_client.py demo using DeepSeek-R1 paper Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add tests/pdfs/*.pdf to .gitignore (PDF is downloaded by test_client.py) - Add verbose=True to query_agent(): streams tool calls with name/args/output - Fix asyncio usage for run_streamed() (not an async context manager) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

KylinMountain · 2026-02-28T12:34:09Z

Test result:

python test_client.py                     
Loaded 2 document(s) from workspace.
============================================================
Step 1: Indexing PDF and inspecting tree structure
============================================================
Indexing PDF: tests/pdfs/deepseek-r1.pdf
Parsing PDF...
start find_toc_pages



no toc found
process_no_toc
start_index: 1
divide page_list to groups 5
start generate_toc_init
start generate_toc_continue
start generate_toc_continue
start generate_toc_continue
start generate_toc_continue
Document validation: 86 pages, max allowed index: 86
start verify_toc
check all items
accuracy: 70.77%
start fix_incorrect_toc
Fixing 19 incorrect results
start fix_incorrect_toc with 19 incorrect results
Fixing 13 incorrect results
start fix_incorrect_toc with 13 incorrect results
Fixing 12 incorrect results
start fix_incorrect_toc with 12 incorrect results


Indexing complete. Document ID: d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6

Document ID: d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6

Tree Structure:
[0000] Introduction  —  The partial document discusses the development of reasoning ...
[0001] DeepSeek-R1-Zero  —  The partial document discusses the development of reasoning ...
  [0002] Group Relative Policy Optimization  —  The partial document discusses the development of reasoning ...
  [0003] Reward Design  —  The partial document describes the GRPO optimization method ...
  [0004] Incentivize Reasoning Capability in LLMs  —  The partial document describes the training and development ...
[0005] DeepSeek-R1  —  The partial document describes the development and multi-sta...
  [0006] Model-based Rewards  —  The partial document describes the development and training ...
  [0007] Training Details  —  The partial document discusses the training and methodology ...
    [0008] Training Details of the First RL Stage  —  The partial document describes the training and evaluation p...
    [0009] Training Details of the Second RL Stage  —  The partial document discusses the training and evaluation o...
[0010] Experiment  —  The partial document discusses the training and evaluation o...
[0011] Ethics and Safety Statement  —  The partial document discusses the ethical and safety consid...
[0012] Conclusion, Limitation, and Future Work  —  The partial document discusses the ethical considerations, s...
[0013] Author List  —  The partial document discusses the challenges and advancemen...
[0014] Appendix  —  The partial document provides an overview of DeepSeek V3, an...
[0015] Background  —  The partial document provides an overview of DeepSeek V3, an...
  [0016] DeepSeek-V3  —  The partial document provides an overview of DeepSeek V3, an...
  [0017] Conventional Post-Training Paradigm  —  The partial document provides an overview of DeepSeek-V3, an...
  [0018] A Comparison of GRPO and PPO  —  The partial document discusses the strengths and limitations...
[0019] Training Details  —  The partial document describes a reinforcement learning (RL)...
  [0020] RL Infrastructure  —  The partial document describes a reinforcement learning (RL)...
  [0021] Reward Model Prompt  —  The partial document covers the following main points:

1. *...
  [0022] Data Recipe  —  The partial document provides an overview of Reinforcement L...
    [0023] RL Data  —  The partial document provides a detailed description of Rein...
    [0024] DeepSeek-R1 Cold Start  —  The partial document describes the development and evaluatio...
    [0025] 800K Supervised Data  —  The partial document covers examples of solving basic arithm...
    [0026] SFT Data Statistics  —  The partial document primarily discusses the use of DeepSeek...
    [0027] Examples of SFT Trajectories  —  The partial document discusses the design principles and met...
  [0028] Hyper-Parameters  —  The partial document covers two main sections. The first sec...
    [0029] Hyper-Parameters of DeepSeek-R1-Zero-Qwen-32B  —  The partial document covers two main sections. The first sec...
    [0030] Hyper-Parameters of SFT  —  The partial document covers the following main points:

1. *...
    [0031] Hyper-Parameters of Distillation  —  The partial document covers the following main points:

1. *...
    [0032] Training Cost  —  The partial document discusses the phenomenon of reward hack...
  [0033] Reward Hacking  —  The partial document discusses the main points related to it...
  [0034] Ablation Study of Language Consistency Reward  —  The partial document covers the following main points:

1. *...
[0035] Self-Evolution of DeepSeek-R1-Zero  —  The partial document covers the following main points:

1. *...
  [0036] Evolution of Reasoning Capability in DeepSeek-R1-Zero during Training  —  The partial document discusses the main points related to it...
  [0037] Evolution of Advanced Reasoning Behaviors in DeepSeek-R1-Zero during Training  —  The partial document provides detailed insights into the tra...
[0038] Evaluation of DeepSeek-R1  —  The partial document discusses the evaluation of reasoning b...
  [0039] Experiment Setup  —  The partial document covers the evaluation of the DeepSeek-R...
  [0040] Main Results  —  The partial document provides an evaluation of the performan...
  [0041] DeepSeek-R1 Safety Report  —  The partial document discusses the performance and safety ev...
    [0042] Risk Control System for DeepSeek-R1  —  The partial document focuses on the safety assessment of the...
    [0043] R1 Safety Evaluation on Standard Benchmarks  —  The partial document focuses on the safety assessment and ri...
    [0044] Safety Taxonomic Study of R1 on In-House Benchmark  —  The partial document provides a detailed comparison of the D...
    [0045] Multilingual Safety Performance  —  The partial document focuses on evaluating the multilingual ...
    [0046] Robustness against Jailbreaking  —  The partial document provides a comparative analysis of the ...
[0047] More Analysis  —  The partial document provides a comparative analysis of two ...
  [0048] Performance Comparison with DeepSeek-V3  —  The partial document provides a comparative analysis of two ...
  [0049] Generalization to Real-World Competitions  —  The partial document provides an analysis of the performance...
  [0050] Mathematical Capabilities Breakdown by Categories  —  The partial document discusses the performance and computati...
  [0051] An Analysis on CoT Length  —  The partial document discusses the performance and computati...
  [0052] Performance of Each Stage on Problems of Varying Difficulty  —  The partial document discusses the limitations of majority v...
[0053] DeepSeek-R1 Distillation  —  The partial document discusses the limitations of majority v...
  [0054] Distillation v.s. Reinforcement Learning  —  The partial document discusses the effectiveness of distilla...
[0055] Discussion  —  The partial document discusses the main points related to it...
  [0056] Key Findings  —  The partial document discusses the evaluation and advancemen...
  [0057] Unsuccessful Attempts  —  The partial document discusses advanced methods for improvin...
[0058] Related Work  —  The partial document discusses the main points related to it...
  [0059] Chain-of-thought Reasoning  —  The partial document provides an in-depth analysis of the pe...
  [0060] Scaling Inference-time Compute  —  The partial document discusses various evaluation benchmarks...
  [0061] Reinforcement Learning for Reasoning Enhancement  —  The partial document discusses two evaluation benchmarks for...
[0062] Open Weights, Code, and Data  —  The partial document discusses various evaluation benchmarks...
[0063] Evaluation Prompts and Settings  —  The partial document provides an overview of various evaluat...
[0064] References  —  The partial document primarily covers advancements and resea...

============================================================
Step 2: Document Metadata (get_document)
============================================================
{"doc_id": "d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6", "doc_name": "deepseek-r1.pdf", "doc_description": "A comprehensive document detailing the development, training, evaluation, and safety considerations of the DeepSeek-R1 model and its variants, focusing on enhancing reasoning capabilities in large language models through reinforcement learning, supervised fine-tuning, and distillation, while addressing challenges like reward hacking, language consistency, and ethical risks.", "type": "pdf", "status": "completed", "page_count": 86}

============================================================
Step 3: Agent Query (auto tool-use)
============================================================

Question: 'What are the main conclusions of this paper?'

Answer:
The main conclusions of the paper are:

1. **Reasoning Potential and Training Strategy**:
   - DeepSeek-R1 models demonstrate strong reasoning abilities, emerging organically during reinforcement learning (RL) phases without extensive reliance on human annotation.
   - The key to unlocking this reasoning potential lies in providing hard problems, a reliable verifier, and ample computational resources for RL, rather than large-scale human annotation.

2. **Future Potential**:
   - With advanced RL techniques, AI systems like DeepSeek-R1 show potential to surpass human capabilities in tasks effectively evaluated by verifiers.
   - Integrating tools such as search engines, calculators, or real-world validation tools could greatly enhance reasoning capabilities and solution accuracy.

3. **Limitations**:
   - Structural Outputs: Current suboptimal outputs and lack of tool usage leave room for improvement.
   - Token Efficiency: Issues like overthinking persist, requiring optimization.
   - Language Mixing: The model struggles with languages outside English and Chinese.
   - Prompt Sensitivity: Few-shot prompting degrades performance, thus zero-shot prompting is recommended.
   - Software Engineering: Limited RL application in software engineering tasks leads to poor benchmarking in this domain.
   - RL Challenges: Reward hacking and difficulties in defining reliable reward structures remain significant challenges.

4. **Research Opportunities**:
   - Addressing the limitations in structural outputs, token usage, language proficiency, and reward modeling will be key areas of focus in future iterations of the model.

============================================================
Step 4: Persistence — reload without re-indexing
============================================================
Loaded 3 document(s) from workspace.
Answer from reloaded client:
The main conclusions of the paper are:

1. **DeepSeek-R1 Improvement**: DeepSeek-R1 achieves significant advancements in reasoning through large-scale reinforcement learning (RL), unlocking sophisticated reasoning behaviors like self-verification and reflection.

2. **Role of RL**: The key to high-level reasoning lies in hard reasoning questions, reliable verifiers, and sufficient computational resources, rather than extensive human annotations.

3. **Limitations**:
   - **Structural Output and Tool Use**: Inferior structural output and lack of tool integration (e.g., search engines, calculators).
   - **Token Efficiency**: Occasional inefficiencies, with instances of overthinking on simpler tasks.
   - **Language Mixing**: Optimized for Chinese and English, causing language inconsistencies for other languages.
   - **Prompt Sensitivity**: Struggles with few-shot prompting; zero-shot is recommended.
   - **Software Engineering**: Limited RL application to software engineering tasks due to inefficiency.

4. **Challenges in Pure RL**:
   - Reliance on robust reward models, which are difficult to construct for tasks like writing.
   - Reward hacking risks, where models exploit reward functions rather than solving tasks effectively.

5. **Future Directions**:
   - Tools integration (e.g., compilers, search engines) to enhance reasoning and solutions.
   - Development of reliable reward structures for tasks with less objective verifiability.

DeepSeek-R1 demonstrates the potential of RL to surpass human capabilities in certain domains if supplied with effective verifiers and reliable feedback mechanisms.

Persistence verified. ✓

Copilot

Pull request overview

Adds a retrieval + QA layer on top of the existing PageIndex tree builders by introducing tool-style retrieval functions and a high-level PageIndexClient that uses the OpenAI Agents SDK to autonomously navigate document structure and fetch relevant page/line content for answering questions.

Changes:

Added pageindex/retrieve.py with 3 JSON-returning retrieval tools: document metadata, token-efficient structure, and page/line content retrieval.
Added pageindex/client.py implementing PageIndexClient (indexing, workspace persistence, and agent-driven querying).
Added a runnable demo script (test_client.py) and updated exports/dependencies (pageindex/__init__.py, requirements.txt).

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`test_client.py`	Demo script that downloads a PDF, indexes it, and runs agent queries including workspace reload.
`requirements.txt`	Adds `openai-agents` dependency for the agent-based client.
`pageindex/utils.py`	Adds streaming helper and tree/node printing/mapping utilities.
`pageindex/retrieve.py`	Implements the 3 retrieval tool functions (metadata, structure, page/line content).
`pageindex/client.py`	Introduces `PageIndexClient` with indexing, persistence, and OpenAI Agents SDK integration.
`pageindex/__init__.py`	Exposes retrieval tools and `PageIndexClient` at package top-level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T17:09:15Z

pageindex/client.py

+        return asyncio.run(_run_verbose())
+


Verbose mode also calls asyncio.run(_run_verbose()), which has the same event-loop reentrancy problem and will break in Jupyter/async servers. Making query_agent async (or offering a separate async entrypoint) would avoid this runtime failure.

Suggested change

return asyncio.run(_run_verbose())

# In synchronous contexts without a running event loop, it is safe to use asyncio.run.

# If an event loop is already running (e.g., Jupyter or an async server), instruct

# callers to use the async_query_agent coroutine instead to avoid event-loop reentrancy.

try:

asyncio.get_running_loop()

except RuntimeError:

return asyncio.run(_run_verbose())

else:

raise RuntimeError(

"query_agent(verbose=True) cannot be used from within an existing asyncio "

"event loop. Use 'await async_query_agent(..., verbose=True)' instead."

)

async def async_query_agent(self, doc_id: str, prompt: str, verbose: bool = False) -> str:

"""

Async variant of query_agent.

This coroutine can be safely awaited from within an existing asyncio event loop,

including Jupyter notebooks and async servers.

"""

client_self = self

@function_tool

def get_document() -> str:

"""Get document metadata: status, page count, name, and description."""

return client_self.get_document(doc_id)

@function_tool

def get_document_structure() -> str:

"""Get the document's full tree structure (without text) to find relevant sections."""

return client_self.get_document_structure(doc_id)

@function_tool

def get_page_content(pages: str) -> str:

"""

Get the text content of specific pages or line numbers.

Use tight ranges: e.g. '5-7' for pages 5 to 7, '3,8' for pages 3 and 8, '12' for page 12.

For Markdown documents, use line numbers from the structure's line_num field.

"""

return client_self.get_page_content(doc_id, pages)

agent = Agent(

name="PageIndex",

instructions=AGENT_SYSTEM_PROMPT,

tools=[get_document, get_document_structure, get_page_content],

model=self.model,

)

if not verbose:

# Run the synchronous Runner.run_sync in a background thread to avoid

# blocking the event loop.

result = await asyncio.to_thread(Runner.run_sync, agent, prompt)

return result.final_output

# verbose mode: stream events and print tool calls

async def _run_verbose():

turn = 0

stream = Runner.run_streamed(agent, prompt)

async for event in stream.stream_events():

if not isinstance(event, RunItemStreamEvent):

continue

if event.name == "tool_called":

turn += 1

raw = event.item.raw_item

args = getattr(raw, "arguments", "{}")

print(f"\n[Turn {turn}] → {raw.name}({args})")

elif event.name == "tool_output":

output = str(event.item.output)

preview = output[:200] + "..." if len(output) > 200 else output

print(f" ← {preview}")

return stream.final_output

return await _run_verbose()

Copilot · 2026-02-28T17:09:16Z

pageindex/client.py

+import os
+import uuid
+import json
+import asyncio
+from pathlib import Path
+from typing import List, Dict, Any, Optional
+
+from agents import Agent, Runner, function_tool
+from agents.stream_events import RunItemStreamEvent
+
+from .page_index import page_index
+from .page_index_md import md_to_tree
+from .retrieve import tool_get_document, tool_get_document_structure, tool_get_page_content
+from .utils import remove_fields, create_node_mapping
+


There are several unused imports here (List, Dict, Any, Optional, plus remove_fields / create_node_mapping). Please remove unused imports to keep the module tidy (and avoid failing lint/CI if enabled).

Copilot · 2026-02-28T17:09:16Z

pageindex/__init__.py

-from .page_index_md import md_to_tree
+from .page_index_md import md_to_tree
+from .retrieve import tool_get_document, tool_get_document_structure, tool_get_page_content
+from .client import PageIndexClient


Importing PageIndexClient at package import time makes openai-agents (and its agents module) a hard requirement even for users who only want the core indexing utilities. If you want Agents support to be optional, consider moving this import behind a try/except (and raising a helpful error only when PageIndexClient is used) or exposing it via a separate module/extra.

Suggested change

from .client import PageIndexClient

try:

# Importing PageIndexClient may require optional dependencies (e.g., openai-agents).

# Wrap this in a try/except so that importing the core package does not

# force those optional dependencies to be installed.

from .client import PageIndexClient

except ImportError as _pageindex_client_import_error:

class _MissingPageIndexClient:

"""

Placeholder for PageIndexClient when its optional dependencies are not installed.

Any attempt to instantiate or otherwise use this object will raise a clear

ImportError explaining how to enable this functionality.

"""

def __init__(self, error: ImportError) -> None:

self._error = error

def __call__(self, *args, **kwargs):

raise ImportError(

"PageIndexClient is unavailable because its optional dependencies "

"could not be imported. Install the 'openai-agents' extra (or any "

"other required optional dependencies) to use PageIndexClient."

) from self._error

def __getattr__(self, name):

raise ImportError(

"PageIndexClient is unavailable because its optional dependencies "

"could not be imported. Install the 'openai-agents' extra (or any "

"other required optional dependencies) to use PageIndexClient."

) from self._error

PageIndexClient = _MissingPageIndexClient(_pageindex_client_import_error) # type: ignore[assignment]

Copilot · 2026-02-28T17:09:16Z

pageindex/retrieve.py

+def _parse_pages(pages: str) -> list[int]:
+    """Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints."""
+    result = []
+    for part in pages.split(','):
+        part = part.strip()
+        if '-' in part:
+            start, end = part.split('-', 1)
+            result.extend(range(int(start.strip()), int(end.strip()) + 1))
+        else:
+            result.append(int(part))


_parse_pages() will happily accept reversed ranges like "7-5" (yielding an empty range), non-positive page numbers, and very large ranges that can allocate huge lists. Consider validating start <= end, rejecting values < 1, and enforcing a max range size (returning a clear error JSON) so tool calls fail loudly instead of silently returning empty content or causing memory spikes.

Suggested change

def _parse_pages(pages: str) -> list[int]:

"""Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints."""

result = []

for part in pages.split(','):

part = part.strip()

if '-' in part:

start, end = part.split('-', 1)

result.extend(range(int(start.strip()), int(end.strip()) + 1))

else:

result.append(int(part))

# Hard cap on how many pages can be requested/expanded in a single call.

# This prevents accidental or malicious requests like "1-1000000000" from

# causing huge in-memory lists to be allocated.

MAX_PAGE_RANGE = 10000

def _parse_pages(pages: str) -> list[int]:

"""Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints.

Validation rules:

- Page numbers must be positive integers (>= 1).

- Ranges must have start <= end.

- The total number of pages produced may not exceed MAX_PAGE_RANGE.

Raises:

ValueError: If the input is malformed or violates any of the rules above.

"""

if pages is None:

raise ValueError("pages specification must be a non-empty string")

pages_str = str(pages).strip()

if not pages_str:

# Empty pages string: treat as no pages requested.

return []

result: list[int] = []

total_count = 0

for part in pages_str.split(','):

part = part.strip()

if not part:

continue

if '-' in part:

start_str, end_str = part.split('-', 1)

try:

start = int(start_str.strip())

end = int(end_str.strip())

except ValueError as exc:

raise ValueError(f"Invalid page range '{part}': not an integer") from exc

if start < 1 or end < 1:

raise ValueError(f"Invalid page range '{part}': pages must be >= 1")

if start > end:

raise ValueError(f"Invalid page range '{part}': start must be <= end")

range_len = end - start + 1

if total_count + range_len > MAX_PAGE_RANGE:

raise ValueError(

f"Requested pages exceed maximum allowed ({MAX_PAGE_RANGE})"

)

result.extend(range(start, end + 1))

total_count += range_len

else:

try:

page_num = int(part)

except ValueError as exc:

raise ValueError(f"Invalid page number '{part}': not an integer") from exc

if page_num < 1:

raise ValueError(f"Invalid page number '{part}': must be >= 1")

if total_count + 1 > MAX_PAGE_RANGE:

raise ValueError(

f"Requested pages exceed maximum allowed ({MAX_PAGE_RANGE})"

)

result.append(page_num)

total_count += 1

Copilot · 2026-02-28T17:09:16Z

pageindex/retrieve.py

+    """Extract text for specific PDF pages (1-indexed)."""
+    pdf_pages = get_page_tokens(doc_info['path'])
+    total = len(pdf_pages)
+    results = []
+    for p in page_nums:
+        if 1 <= p <= total:
+            results.append({'page': p, 'content': pdf_pages[p - 1][0]})


For PDFs, _get_pdf_page_content() calls get_page_tokens() which extracts and tokenizes all pages on every tool call, even when only a few pages are requested. This is likely to be a major slowdown when the agent iterates. Consider caching the page list per doc_id during indexing, or extracting only the requested pages (without tokenization) for this tool.

Suggested change

"""Extract text for specific PDF pages (1-indexed)."""

pdf_pages = get_page_tokens(doc_info['path'])

total = len(pdf_pages)

results = []

for p in page_nums:

if 1 <= p <= total:

results.append({'page': p, 'content': pdf_pages[p - 1][0]})

"""Extract text for specific PDF pages (1-indexed) without tokenizing all pages."""

path = doc_info['path']

total_pages = get_number_of_pages(path)

# Keep only valid page numbers within the document range.

valid_pages = [p for p in page_nums if 1 <= p <= total_pages]

if not valid_pages:

return []

# Fetch text only for the requested pages to avoid processing the entire document.

page_texts = get_text_of_pdf_pages(path, valid_pages)

results = []

for p, text in zip(valid_pages, page_texts):

results.append({'page': p, 'content': text})

Copilot · 2026-02-28T17:09:17Z

test_client.py

+    r = requests.get(PDF_URL)
+    with open(PDF_PATH, "wb") as f:
+        f.write(r.content)


The PDF download does not set a timeout or validate the HTTP response. Add a reasonable timeout and call raise_for_status() (and ideally download with stream=True) to avoid hangs and silently writing error pages to disk.

Suggested change

r = requests.get(PDF_URL)

with open(PDF_PATH, "wb") as f:

f.write(r.content)

with requests.get(PDF_URL, stream=True, timeout=30) as r:

r.raise_for_status()

with open(PDF_PATH, "wb") as f:

for chunk in r.iter_content(chunk_size=8192):

if chunk: # filter out keep-alive chunks

f.write(chunk)

Copilot · 2026-02-28T17:09:17Z

pageindex/retrieve.py

+    from .utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list
+except ImportError:
+    from utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list


These imports include get_text_of_pdf_pages and structure_to_list, but they are not used in this module. Removing unused imports will avoid confusion and keeps the tool module minimal.

Suggested change

from .utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list

except ImportError:

from utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list

from .utils import get_page_tokens, get_number_of_pages, remove_fields

except ImportError:

from utils import get_page_tokens, get_number_of_pages, remove_fields

Copilot · 2026-02-28T17:09:17Z

pageindex/client.py

+        elif mode == "md" or (mode == "auto" and is_md):
+            print(f"Indexing Markdown: {file_path}")
+            result = asyncio.run(md_to_tree(
+                md_path=file_path,
+                if_thinning=False,
+                if_add_node_summary='yes',
+                summary_token_threshold=200,
+                model=self.model,
+                if_add_doc_description='yes',
+                if_add_node_text='yes',
+                if_add_node_id='yes'
+            ))
+            self.documents[doc_id] = {


asyncio.run(md_to_tree(...)) will raise RuntimeError: asyncio.run() cannot be called from a running event loop in notebooks / async apps. Consider providing an async index_async() / query_agent_async() API and using await when already inside an event loop (or using a loop-aware helper) so the client works in common interactive environments.

KylinMountain and others added 5 commits February 28, 2026 18:19

Rename go_deeper to explore in retrieve prompt for clarity

cc4c5fc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor to tools and agent

94c8838

remove test mock client

239e5e8

KylinMountain changed the title ~~Replace retrieve() with 3-tool OpenAI Agents SDK agent~~ Add retrieve function Feb 28, 2026

KylinMountain changed the title ~~Add retrieve function~~ Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK Feb 28, 2026

BukeLy requested a review from Copilot February 28, 2026 17:04

Copilot started reviewing on behalf of BukeLy February 28, 2026 17:05 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK#125

Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK#125
KylinMountain wants to merge 5 commits intoVectifyAI:mainfrom
KylinMountain:feat/retrieve

KylinMountain commented Feb 28, 2026 •

edited

Loading

Uh oh!

KylinMountain commented Feb 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        return asyncio.run(_run_verbose())
+        # In synchronous contexts without a running event loop, it is safe to use asyncio.run.
+        # If an event loop is already running (e.g., Jupyter or an async server), instruct
+        # callers to use the async_query_agent coroutine instead to avoid event-loop reentrancy.
+        try:
+            asyncio.get_running_loop()
+        except RuntimeError:
+            return asyncio.run(_run_verbose())
+        else:
+            raise RuntimeError(
+                "query_agent(verbose=True) cannot be used from within an existing asyncio "
+                "event loop. Use 'await async_query_agent(..., verbose=True)' instead."
+            )
+    async def async_query_agent(self, doc_id: str, prompt: str, verbose: bool = False) -> str:
+        """
+        Async variant of query_agent.
+        This coroutine can be safely awaited from within an existing asyncio event loop,
+        including Jupyter notebooks and async servers.
+        """
+        client_self = self
+        @function_tool
+        def get_document() -> str:
+            """Get document metadata: status, page count, name, and description."""
+            return client_self.get_document(doc_id)
+        @function_tool
+        def get_document_structure() -> str:
+            """Get the document's full tree structure (without text) to find relevant sections."""
+            return client_self.get_document_structure(doc_id)
+        @function_tool
+        def get_page_content(pages: str) -> str:
+            """
+            Get the text content of specific pages or line numbers.
+            Use tight ranges: e.g. '5-7' for pages 5 to 7, '3,8' for pages 3 and 8, '12' for page 12.
+            For Markdown documents, use line numbers from the structure's line_num field.
+            """
+            return client_self.get_page_content(doc_id, pages)
+        agent = Agent(
+            name="PageIndex",
+            instructions=AGENT_SYSTEM_PROMPT,
+            tools=[get_document, get_document_structure, get_page_content],
+            model=self.model,
+        )
+        if not verbose:
+            # Run the synchronous Runner.run_sync in a background thread to avoid
+            # blocking the event loop.
+            result = await asyncio.to_thread(Runner.run_sync, agent, prompt)
+            return result.final_output
+        # verbose mode: stream events and print tool calls
+        async def _run_verbose():
+            turn = 0
+            stream = Runner.run_streamed(agent, prompt)
+            async for event in stream.stream_events():
+                if not isinstance(event, RunItemStreamEvent):
+                    continue
+                if event.name == "tool_called":
+                    turn += 1
+                    raw = event.item.raw_item
+                    args = getattr(raw, "arguments", "{}")
+                    print(f"\n[Turn {turn}] → {raw.name}({args})")
+                elif event.name == "tool_output":
+                    output = str(event.item.output)
+                    preview = output[:200] + "..." if len(output) > 200 else output
+                    print(f"         ← {preview}")
+            return stream.final_output
+        return await _run_verbose()

-from .client import PageIndexClient
+try:
+    # Importing PageIndexClient may require optional dependencies (e.g., openai-agents).
+    # Wrap this in a try/except so that importing the core package does not
+    # force those optional dependencies to be installed.
+    from .client import PageIndexClient
+except ImportError as _pageindex_client_import_error:
+    class _MissingPageIndexClient:
+        """
+        Placeholder for PageIndexClient when its optional dependencies are not installed.
+        Any attempt to instantiate or otherwise use this object will raise a clear
+        ImportError explaining how to enable this functionality.
+        """
+        def __init__(self, error: ImportError) -> None:
+            self._error = error
+        def __call__(self, *args, **kwargs):
+            raise ImportError(
+                "PageIndexClient is unavailable because its optional dependencies "
+                "could not be imported. Install the 'openai-agents' extra (or any "
+                "other required optional dependencies) to use PageIndexClient."
+            ) from self._error
+        def __getattr__(self, name):
+            raise ImportError(
+                "PageIndexClient is unavailable because its optional dependencies "
+                "could not be imported. Install the 'openai-agents' extra (or any "
+                "other required optional dependencies) to use PageIndexClient."
+            ) from self._error
+    PageIndexClient = _MissingPageIndexClient(_pageindex_client_import_error)  # type: ignore[assignment]

-def _parse_pages(pages: str) -> list[int]:
-    """Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints."""
-    result = []
-    for part in pages.split(','):
-        part = part.strip()
-        if '-' in part:
-            start, end = part.split('-', 1)
-            result.extend(range(int(start.strip()), int(end.strip()) + 1))
-        else:
-            result.append(int(part))
+# Hard cap on how many pages can be requested/expanded in a single call.
+# This prevents accidental or malicious requests like "1-1000000000" from
+# causing huge in-memory lists to be allocated.
+MAX_PAGE_RANGE = 10000
+def _parse_pages(pages: str) -> list[int]:
+    """Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints.
+    Validation rules:
+      - Page numbers must be positive integers (>= 1).
+      - Ranges must have start <= end.
+      - The total number of pages produced may not exceed MAX_PAGE_RANGE.
+    Raises:
+      ValueError: If the input is malformed or violates any of the rules above.
+    """
+    if pages is None:
+        raise ValueError("pages specification must be a non-empty string")
+    pages_str = str(pages).strip()
+    if not pages_str:
+        # Empty pages string: treat as no pages requested.
+        return []
+    result: list[int] = []
+    total_count = 0
+    for part in pages_str.split(','):
+        part = part.strip()
+        if not part:
+            continue
+        if '-' in part:
+            start_str, end_str = part.split('-', 1)
+            try:
+                start = int(start_str.strip())
+                end = int(end_str.strip())
+            except ValueError as exc:
+                raise ValueError(f"Invalid page range '{part}': not an integer") from exc
+            if start < 1 or end < 1:
+                raise ValueError(f"Invalid page range '{part}': pages must be >= 1")
+            if start > end:
+                raise ValueError(f"Invalid page range '{part}': start must be <= end")
+            range_len = end - start + 1
+            if total_count + range_len > MAX_PAGE_RANGE:
+                raise ValueError(
+                    f"Requested pages exceed maximum allowed ({MAX_PAGE_RANGE})"
+                )
+            result.extend(range(start, end + 1))
+            total_count += range_len
+        else:
+            try:
+                page_num = int(part)
+            except ValueError as exc:
+                raise ValueError(f"Invalid page number '{part}': not an integer") from exc
+            if page_num < 1:
+                raise ValueError(f"Invalid page number '{part}': must be >= 1")
+            if total_count + 1 > MAX_PAGE_RANGE:
+                raise ValueError(
+                    f"Requested pages exceed maximum allowed ({MAX_PAGE_RANGE})"
+                )
+            result.append(page_num)
+            total_count += 1

-    """Extract text for specific PDF pages (1-indexed)."""
-    pdf_pages = get_page_tokens(doc_info['path'])
-    total = len(pdf_pages)
-    results = []
-    for p in page_nums:
-        if 1 <= p <= total:
-            results.append({'page': p, 'content': pdf_pages[p - 1][0]})
+    """Extract text for specific PDF pages (1-indexed) without tokenizing all pages."""
+    path = doc_info['path']
+    total_pages = get_number_of_pages(path)
+    # Keep only valid page numbers within the document range.
+    valid_pages = [p for p in page_nums if 1 <= p <= total_pages]
+    if not valid_pages:
+        return []
+    # Fetch text only for the requested pages to avoid processing the entire document.
+    page_texts = get_text_of_pdf_pages(path, valid_pages)
+    results = []
+    for p, text in zip(valid_pages, page_texts):
+        results.append({'page': p, 'content': text})

-    r = requests.get(PDF_URL)
-    with open(PDF_PATH, "wb") as f:
-        f.write(r.content)
+    with requests.get(PDF_URL, stream=True, timeout=30) as r:
+        r.raise_for_status()
+        with open(PDF_PATH, "wb") as f:
+            for chunk in r.iter_content(chunk_size=8192):
+                if chunk:  # filter out keep-alive chunks
+                    f.write(chunk)

Conversation

KylinMountain commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

New: pageindex/retrieve.py — 3 retrieval tool functions

New: pageindex/client.py — PageIndexClient

Demo: OpenAI Agents SDK

Test plan

Uh oh!

KylinMountain commented Feb 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KylinMountain commented Feb 28, 2026 •

edited

Loading

New: `pageindex/retrieve.py` — 3 retrieval tool functions

New: `pageindex/client.py` — `PageIndexClient`