Skip to content

Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK#125

Open
KylinMountain wants to merge 5 commits intoVectifyAI:mainfrom
KylinMountain:feat/retrieve
Open

Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK#125
KylinMountain wants to merge 5 commits intoVectifyAI:mainfrom
KylinMountain:feat/retrieve

Conversation

@KylinMountain
Copy link
Collaborator

@KylinMountain KylinMountain commented Feb 28, 2026

What this PR adds

The upstream library provides page_index() and md_to_tree() for building document tree structures, but has no retrieval or QA layer. This PR adds that layer.

New: pageindex/retrieve.py — 3 retrieval tool functions

Three functions that expose structured document access:

  • tool_get_document(documents, doc_id) — metadata (name, description, type, page count)
  • tool_get_document_structure(documents, doc_id) — full tree JSON without text (token-efficient)
  • tool_get_page_content(documents, doc_id, pages) — page text by range ("5-7", "3,8", "12")

Works with both PDF (page numbers) and Markdown (line numbers).

New: pageindex/client.pyPageIndexClient

High-level SDK client:

  • index(file_path) — index a PDF or Markdown file, returns doc_id
  • query_agent(doc_id, prompt, verbose=False) — runs an OpenAI Agents SDK agent that calls the 3 tools autonomously to answer the question
  • query(doc_id, prompt) / query_stream(doc_id, prompt) — convenience wrappers
  • workspace parameter for JSON-based persistence across sessions

Demo: OpenAI Agents SDK

The agent navigates the document structure itself — no manual retrieval logic needed:

query_agent(doc_id, "What are the conclusions?")
  Turn 1: get_document()            → confirms status and page count
  Turn 2: get_document_structure()  → reads tree to find relevant sections
  Turn 3: get_page_content("10-13") → fetches targeted page content
  Turn 4: synthesizes answer

With verbose=True, each tool call (name, args, result preview) is printed in real time.

Test plan

  • pip install openai-agents
  • python test_client.py — downloads DeepSeek-R1 PDF, indexes it, runs agent query
  • client.query_agent(doc_id, "...", verbose=True) — observe tool call sequence
  • Restart Python, reload PageIndexClient(workspace=...) — query works without re-indexing

KylinMountain and others added 5 commits February 28, 2026 18:19
- Add PageIndexClient with index/retrieve/query workflow
- Add workspace parameter for automatic JSON-based persistence
- Add query_stream() for token-level streaming output
- Add ChatGPT_API_stream() generator in utils.py
- Add test_client.py demo using DeepSeek-R1 paper

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add tests/pdfs/*.pdf to .gitignore (PDF is downloaded by test_client.py)
- Add verbose=True to query_agent(): streams tool calls with name/args/output
- Fix asyncio usage for run_streamed() (not an async context manager)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@KylinMountain KylinMountain changed the title Replace retrieve() with 3-tool OpenAI Agents SDK agent Add retrieve function Feb 28, 2026
@KylinMountain
Copy link
Collaborator Author

Test result:

python test_client.py                     
Loaded 2 document(s) from workspace.
============================================================
Step 1: Indexing PDF and inspecting tree structure
============================================================
Indexing PDF: tests/pdfs/deepseek-r1.pdf
Parsing PDF...
start find_toc_pages



no toc found
process_no_toc
start_index: 1
divide page_list to groups 5
start generate_toc_init
start generate_toc_continue
start generate_toc_continue
start generate_toc_continue
start generate_toc_continue
Document validation: 86 pages, max allowed index: 86
start verify_toc
check all items
accuracy: 70.77%
start fix_incorrect_toc
Fixing 19 incorrect results
start fix_incorrect_toc with 19 incorrect results
Fixing 13 incorrect results
start fix_incorrect_toc with 13 incorrect results
Fixing 12 incorrect results
start fix_incorrect_toc with 12 incorrect results


Indexing complete. Document ID: d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6

Document ID: d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6

Tree Structure:
[0000] Introduction  —  The partial document discusses the development of reasoning ...
[0001] DeepSeek-R1-Zero  —  The partial document discusses the development of reasoning ...
  [0002] Group Relative Policy Optimization  —  The partial document discusses the development of reasoning ...
  [0003] Reward Design  —  The partial document describes the GRPO optimization method ...
  [0004] Incentivize Reasoning Capability in LLMs  —  The partial document describes the training and development ...
[0005] DeepSeek-R1  —  The partial document describes the development and multi-sta...
  [0006] Model-based Rewards  —  The partial document describes the development and training ...
  [0007] Training Details  —  The partial document discusses the training and methodology ...
    [0008] Training Details of the First RL Stage  —  The partial document describes the training and evaluation p...
    [0009] Training Details of the Second RL Stage  —  The partial document discusses the training and evaluation o...
[0010] Experiment  —  The partial document discusses the training and evaluation o...
[0011] Ethics and Safety Statement  —  The partial document discusses the ethical and safety consid...
[0012] Conclusion, Limitation, and Future Work  —  The partial document discusses the ethical considerations, s...
[0013] Author List  —  The partial document discusses the challenges and advancemen...
[0014] Appendix  —  The partial document provides an overview of DeepSeek V3, an...
[0015] Background  —  The partial document provides an overview of DeepSeek V3, an...
  [0016] DeepSeek-V3  —  The partial document provides an overview of DeepSeek V3, an...
  [0017] Conventional Post-Training Paradigm  —  The partial document provides an overview of DeepSeek-V3, an...
  [0018] A Comparison of GRPO and PPO  —  The partial document discusses the strengths and limitations...
[0019] Training Details  —  The partial document describes a reinforcement learning (RL)...
  [0020] RL Infrastructure  —  The partial document describes a reinforcement learning (RL)...
  [0021] Reward Model Prompt  —  The partial document covers the following main points:

1. *...
  [0022] Data Recipe  —  The partial document provides an overview of Reinforcement L...
    [0023] RL Data  —  The partial document provides a detailed description of Rein...
    [0024] DeepSeek-R1 Cold Start  —  The partial document describes the development and evaluatio...
    [0025] 800K Supervised Data  —  The partial document covers examples of solving basic arithm...
    [0026] SFT Data Statistics  —  The partial document primarily discusses the use of DeepSeek...
    [0027] Examples of SFT Trajectories  —  The partial document discusses the design principles and met...
  [0028] Hyper-Parameters  —  The partial document covers two main sections. The first sec...
    [0029] Hyper-Parameters of DeepSeek-R1-Zero-Qwen-32B  —  The partial document covers two main sections. The first sec...
    [0030] Hyper-Parameters of SFT  —  The partial document covers the following main points:

1. *...
    [0031] Hyper-Parameters of Distillation  —  The partial document covers the following main points:

1. *...
    [0032] Training Cost  —  The partial document discusses the phenomenon of reward hack...
  [0033] Reward Hacking  —  The partial document discusses the main points related to it...
  [0034] Ablation Study of Language Consistency Reward  —  The partial document covers the following main points:

1. *...
[0035] Self-Evolution of DeepSeek-R1-Zero  —  The partial document covers the following main points:

1. *...
  [0036] Evolution of Reasoning Capability in DeepSeek-R1-Zero during Training  —  The partial document discusses the main points related to it...
  [0037] Evolution of Advanced Reasoning Behaviors in DeepSeek-R1-Zero during Training  —  The partial document provides detailed insights into the tra...
[0038] Evaluation of DeepSeek-R1  —  The partial document discusses the evaluation of reasoning b...
  [0039] Experiment Setup  —  The partial document covers the evaluation of the DeepSeek-R...
  [0040] Main Results  —  The partial document provides an evaluation of the performan...
  [0041] DeepSeek-R1 Safety Report  —  The partial document discusses the performance and safety ev...
    [0042] Risk Control System for DeepSeek-R1  —  The partial document focuses on the safety assessment of the...
    [0043] R1 Safety Evaluation on Standard Benchmarks  —  The partial document focuses on the safety assessment and ri...
    [0044] Safety Taxonomic Study of R1 on In-House Benchmark  —  The partial document provides a detailed comparison of the D...
    [0045] Multilingual Safety Performance  —  The partial document focuses on evaluating the multilingual ...
    [0046] Robustness against Jailbreaking  —  The partial document provides a comparative analysis of the ...
[0047] More Analysis  —  The partial document provides a comparative analysis of two ...
  [0048] Performance Comparison with DeepSeek-V3  —  The partial document provides a comparative analysis of two ...
  [0049] Generalization to Real-World Competitions  —  The partial document provides an analysis of the performance...
  [0050] Mathematical Capabilities Breakdown by Categories  —  The partial document discusses the performance and computati...
  [0051] An Analysis on CoT Length  —  The partial document discusses the performance and computati...
  [0052] Performance of Each Stage on Problems of Varying Difficulty  —  The partial document discusses the limitations of majority v...
[0053] DeepSeek-R1 Distillation  —  The partial document discusses the limitations of majority v...
  [0054] Distillation v.s. Reinforcement Learning  —  The partial document discusses the effectiveness of distilla...
[0055] Discussion  —  The partial document discusses the main points related to it...
  [0056] Key Findings  —  The partial document discusses the evaluation and advancemen...
  [0057] Unsuccessful Attempts  —  The partial document discusses advanced methods for improvin...
[0058] Related Work  —  The partial document discusses the main points related to it...
  [0059] Chain-of-thought Reasoning  —  The partial document provides an in-depth analysis of the pe...
  [0060] Scaling Inference-time Compute  —  The partial document discusses various evaluation benchmarks...
  [0061] Reinforcement Learning for Reasoning Enhancement  —  The partial document discusses two evaluation benchmarks for...
[0062] Open Weights, Code, and Data  —  The partial document discusses various evaluation benchmarks...
[0063] Evaluation Prompts and Settings  —  The partial document provides an overview of various evaluat...
[0064] References  —  The partial document primarily covers advancements and resea...

============================================================
Step 2: Document Metadata (get_document)
============================================================
{"doc_id": "d1bdc439-abc5-48c1-bbaa-ffa78b3d5aa6", "doc_name": "deepseek-r1.pdf", "doc_description": "A comprehensive document detailing the development, training, evaluation, and safety considerations of the DeepSeek-R1 model and its variants, focusing on enhancing reasoning capabilities in large language models through reinforcement learning, supervised fine-tuning, and distillation, while addressing challenges like reward hacking, language consistency, and ethical risks.", "type": "pdf", "status": "completed", "page_count": 86}

============================================================
Step 3: Agent Query (auto tool-use)
============================================================

Question: 'What are the main conclusions of this paper?'

Answer:
The main conclusions of the paper are:

1. **Reasoning Potential and Training Strategy**:
   - DeepSeek-R1 models demonstrate strong reasoning abilities, emerging organically during reinforcement learning (RL) phases without extensive reliance on human annotation.
   - The key to unlocking this reasoning potential lies in providing hard problems, a reliable verifier, and ample computational resources for RL, rather than large-scale human annotation.

2. **Future Potential**:
   - With advanced RL techniques, AI systems like DeepSeek-R1 show potential to surpass human capabilities in tasks effectively evaluated by verifiers.
   - Integrating tools such as search engines, calculators, or real-world validation tools could greatly enhance reasoning capabilities and solution accuracy.

3. **Limitations**:
   - Structural Outputs: Current suboptimal outputs and lack of tool usage leave room for improvement.
   - Token Efficiency: Issues like overthinking persist, requiring optimization.
   - Language Mixing: The model struggles with languages outside English and Chinese.
   - Prompt Sensitivity: Few-shot prompting degrades performance, thus zero-shot prompting is recommended.
   - Software Engineering: Limited RL application in software engineering tasks leads to poor benchmarking in this domain.
   - RL Challenges: Reward hacking and difficulties in defining reliable reward structures remain significant challenges.

4. **Research Opportunities**:
   - Addressing the limitations in structural outputs, token usage, language proficiency, and reward modeling will be key areas of focus in future iterations of the model.

============================================================
Step 4: Persistence — reload without re-indexing
============================================================
Loaded 3 document(s) from workspace.
Answer from reloaded client:
The main conclusions of the paper are:

1. **DeepSeek-R1 Improvement**: DeepSeek-R1 achieves significant advancements in reasoning through large-scale reinforcement learning (RL), unlocking sophisticated reasoning behaviors like self-verification and reflection.

2. **Role of RL**: The key to high-level reasoning lies in hard reasoning questions, reliable verifiers, and sufficient computational resources, rather than extensive human annotations.

3. **Limitations**:
   - **Structural Output and Tool Use**: Inferior structural output and lack of tool integration (e.g., search engines, calculators).
   - **Token Efficiency**: Occasional inefficiencies, with instances of overthinking on simpler tasks.
   - **Language Mixing**: Optimized for Chinese and English, causing language inconsistencies for other languages.
   - **Prompt Sensitivity**: Struggles with few-shot prompting; zero-shot is recommended.
   - **Software Engineering**: Limited RL application to software engineering tasks due to inefficiency.

4. **Challenges in Pure RL**:
   - Reliance on robust reward models, which are difficult to construct for tasks like writing.
   - Reward hacking risks, where models exploit reward functions rather than solving tasks effectively.

5. **Future Directions**:
   - Tools integration (e.g., compilers, search engines) to enhance reasoning and solutions.
   - Development of reliable reward structures for tasks with less objective verifiability.

DeepSeek-R1 demonstrates the potential of RL to surpass human capabilities in certain domains if supplied with effective verifiers and reliable feedback mechanisms.

Persistence verified. ✓

@KylinMountain KylinMountain changed the title Add retrieve function Add PageIndexClient with agent-based retrieval via OpenAI Agents SDK Feb 28, 2026
@BukeLy BukeLy requested a review from Copilot February 28, 2026 17:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a retrieval + QA layer on top of the existing PageIndex tree builders by introducing tool-style retrieval functions and a high-level PageIndexClient that uses the OpenAI Agents SDK to autonomously navigate document structure and fetch relevant page/line content for answering questions.

Changes:

  • Added pageindex/retrieve.py with 3 JSON-returning retrieval tools: document metadata, token-efficient structure, and page/line content retrieval.
  • Added pageindex/client.py implementing PageIndexClient (indexing, workspace persistence, and agent-driven querying).
  • Added a runnable demo script (test_client.py) and updated exports/dependencies (pageindex/__init__.py, requirements.txt).

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
test_client.py Demo script that downloads a PDF, indexes it, and runs agent queries including workspace reload.
requirements.txt Adds openai-agents dependency for the agent-based client.
pageindex/utils.py Adds streaming helper and tree/node printing/mapping utilities.
pageindex/retrieve.py Implements the 3 retrieval tool functions (metadata, structure, page/line content).
pageindex/client.py Introduces PageIndexClient with indexing, persistence, and OpenAI Agents SDK integration.
pageindex/__init__.py Exposes retrieval tools and PageIndexClient at package top-level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +193 to +194
return asyncio.run(_run_verbose())

Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verbose mode also calls asyncio.run(_run_verbose()), which has the same event-loop reentrancy problem and will break in Jupyter/async servers. Making query_agent async (or offering a separate async entrypoint) would avoid this runtime failure.

Suggested change
return asyncio.run(_run_verbose())
# In synchronous contexts without a running event loop, it is safe to use asyncio.run.
# If an event loop is already running (e.g., Jupyter or an async server), instruct
# callers to use the async_query_agent coroutine instead to avoid event-loop reentrancy.
try:
asyncio.get_running_loop()
except RuntimeError:
return asyncio.run(_run_verbose())
else:
raise RuntimeError(
"query_agent(verbose=True) cannot be used from within an existing asyncio "
"event loop. Use 'await async_query_agent(..., verbose=True)' instead."
)
async def async_query_agent(self, doc_id: str, prompt: str, verbose: bool = False) -> str:
"""
Async variant of query_agent.
This coroutine can be safely awaited from within an existing asyncio event loop,
including Jupyter notebooks and async servers.
"""
client_self = self
@function_tool
def get_document() -> str:
"""Get document metadata: status, page count, name, and description."""
return client_self.get_document(doc_id)
@function_tool
def get_document_structure() -> str:
"""Get the document's full tree structure (without text) to find relevant sections."""
return client_self.get_document_structure(doc_id)
@function_tool
def get_page_content(pages: str) -> str:
"""
Get the text content of specific pages or line numbers.
Use tight ranges: e.g. '5-7' for pages 5 to 7, '3,8' for pages 3 and 8, '12' for page 12.
For Markdown documents, use line numbers from the structure's line_num field.
"""
return client_self.get_page_content(doc_id, pages)
agent = Agent(
name="PageIndex",
instructions=AGENT_SYSTEM_PROMPT,
tools=[get_document, get_document_structure, get_page_content],
model=self.model,
)
if not verbose:
# Run the synchronous Runner.run_sync in a background thread to avoid
# blocking the event loop.
result = await asyncio.to_thread(Runner.run_sync, agent, prompt)
return result.final_output
# verbose mode: stream events and print tool calls
async def _run_verbose():
turn = 0
stream = Runner.run_streamed(agent, prompt)
async for event in stream.stream_events():
if not isinstance(event, RunItemStreamEvent):
continue
if event.name == "tool_called":
turn += 1
raw = event.item.raw_item
args = getattr(raw, "arguments", "{}")
print(f"\n[Turn {turn}] → {raw.name}({args})")
elif event.name == "tool_output":
output = str(event.item.output)
preview = output[:200] + "..." if len(output) > 200 else output
print(f" ← {preview}")
return stream.final_output
return await _run_verbose()

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +15
import os
import uuid
import json
import asyncio
from pathlib import Path
from typing import List, Dict, Any, Optional

from agents import Agent, Runner, function_tool
from agents.stream_events import RunItemStreamEvent

from .page_index import page_index
from .page_index_md import md_to_tree
from .retrieve import tool_get_document, tool_get_document_structure, tool_get_page_content
from .utils import remove_fields, create_node_mapping

Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several unused imports here (List, Dict, Any, Optional, plus remove_fields / create_node_mapping). Please remove unused imports to keep the module tidy (and avoid failing lint/CI if enabled).

Copilot uses AI. Check for mistakes.
from .page_index_md import md_to_tree No newline at end of file
from .page_index_md import md_to_tree
from .retrieve import tool_get_document, tool_get_document_structure, tool_get_page_content
from .client import PageIndexClient
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing PageIndexClient at package import time makes openai-agents (and its agents module) a hard requirement even for users who only want the core indexing utilities. If you want Agents support to be optional, consider moving this import behind a try/except (and raising a helpful error only when PageIndexClient is used) or exposing it via a separate module/extra.

Suggested change
from .client import PageIndexClient
try:
# Importing PageIndexClient may require optional dependencies (e.g., openai-agents).
# Wrap this in a try/except so that importing the core package does not
# force those optional dependencies to be installed.
from .client import PageIndexClient
except ImportError as _pageindex_client_import_error:
class _MissingPageIndexClient:
"""
Placeholder for PageIndexClient when its optional dependencies are not installed.
Any attempt to instantiate or otherwise use this object will raise a clear
ImportError explaining how to enable this functionality.
"""
def __init__(self, error: ImportError) -> None:
self._error = error
def __call__(self, *args, **kwargs):
raise ImportError(
"PageIndexClient is unavailable because its optional dependencies "
"could not be imported. Install the 'openai-agents' extra (or any "
"other required optional dependencies) to use PageIndexClient."
) from self._error
def __getattr__(self, name):
raise ImportError(
"PageIndexClient is unavailable because its optional dependencies "
"could not be imported. Install the 'openai-agents' extra (or any "
"other required optional dependencies) to use PageIndexClient."
) from self._error
PageIndexClient = _MissingPageIndexClient(_pageindex_client_import_error) # type: ignore[assignment]

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +20
def _parse_pages(pages: str) -> list[int]:
"""Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints."""
result = []
for part in pages.split(','):
part = part.strip()
if '-' in part:
start, end = part.split('-', 1)
result.extend(range(int(start.strip()), int(end.strip()) + 1))
else:
result.append(int(part))
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_pages() will happily accept reversed ranges like "7-5" (yielding an empty range), non-positive page numbers, and very large ranges that can allocate huge lists. Consider validating start <= end, rejecting values < 1, and enforcing a max range size (returning a clear error JSON) so tool calls fail loudly instead of silently returning empty content or causing memory spikes.

Suggested change
def _parse_pages(pages: str) -> list[int]:
"""Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints."""
result = []
for part in pages.split(','):
part = part.strip()
if '-' in part:
start, end = part.split('-', 1)
result.extend(range(int(start.strip()), int(end.strip()) + 1))
else:
result.append(int(part))
# Hard cap on how many pages can be requested/expanded in a single call.
# This prevents accidental or malicious requests like "1-1000000000" from
# causing huge in-memory lists to be allocated.
MAX_PAGE_RANGE = 10000
def _parse_pages(pages: str) -> list[int]:
"""Parse a pages string like '5-7', '3,8', or '12' into a sorted list of ints.
Validation rules:
- Page numbers must be positive integers (>= 1).
- Ranges must have start <= end.
- The total number of pages produced may not exceed MAX_PAGE_RANGE.
Raises:
ValueError: If the input is malformed or violates any of the rules above.
"""
if pages is None:
raise ValueError("pages specification must be a non-empty string")
pages_str = str(pages).strip()
if not pages_str:
# Empty pages string: treat as no pages requested.
return []
result: list[int] = []
total_count = 0
for part in pages_str.split(','):
part = part.strip()
if not part:
continue
if '-' in part:
start_str, end_str = part.split('-', 1)
try:
start = int(start_str.strip())
end = int(end_str.strip())
except ValueError as exc:
raise ValueError(f"Invalid page range '{part}': not an integer") from exc
if start < 1 or end < 1:
raise ValueError(f"Invalid page range '{part}': pages must be >= 1")
if start > end:
raise ValueError(f"Invalid page range '{part}': start must be <= end")
range_len = end - start + 1
if total_count + range_len > MAX_PAGE_RANGE:
raise ValueError(
f"Requested pages exceed maximum allowed ({MAX_PAGE_RANGE})"
)
result.extend(range(start, end + 1))
total_count += range_len
else:
try:
page_num = int(part)
except ValueError as exc:
raise ValueError(f"Invalid page number '{part}': not an integer") from exc
if page_num < 1:
raise ValueError(f"Invalid page number '{part}': must be >= 1")
if total_count + 1 > MAX_PAGE_RANGE:
raise ValueError(
f"Requested pages exceed maximum allowed ({MAX_PAGE_RANGE})"
)
result.append(page_num)
total_count += 1

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +49
"""Extract text for specific PDF pages (1-indexed)."""
pdf_pages = get_page_tokens(doc_info['path'])
total = len(pdf_pages)
results = []
for p in page_nums:
if 1 <= p <= total:
results.append({'page': p, 'content': pdf_pages[p - 1][0]})
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PDFs, _get_pdf_page_content() calls get_page_tokens() which extracts and tokenizes all pages on every tool call, even when only a few pages are requested. This is likely to be a major slowdown when the agent iterates. Consider caching the page list per doc_id during indexing, or extracting only the requested pages (without tokenization) for this tool.

Suggested change
"""Extract text for specific PDF pages (1-indexed)."""
pdf_pages = get_page_tokens(doc_info['path'])
total = len(pdf_pages)
results = []
for p in page_nums:
if 1 <= p <= total:
results.append({'page': p, 'content': pdf_pages[p - 1][0]})
"""Extract text for specific PDF pages (1-indexed) without tokenizing all pages."""
path = doc_info['path']
total_pages = get_number_of_pages(path)
# Keep only valid page numbers within the document range.
valid_pages = [p for p in page_nums if 1 <= p <= total_pages]
if not valid_pages:
return []
# Fetch text only for the requested pages to avoid processing the entire document.
page_texts = get_text_of_pdf_pages(path, valid_pages)
results = []
for p, text in zip(valid_pages, page_texts):
results.append({'page': p, 'content': text})

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +24
r = requests.get(PDF_URL)
with open(PDF_PATH, "wb") as f:
f.write(r.content)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDF download does not set a timeout or validate the HTTP response. Add a reasonable timeout and call raise_for_status() (and ideally download with stream=True) to avoid hangs and silently writing error pages to disk.

Suggested change
r = requests.get(PDF_URL)
with open(PDF_PATH, "wb") as f:
f.write(r.content)
with requests.get(PDF_URL, stream=True, timeout=30) as r:
r.raise_for_status()
with open(PDF_PATH, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk: # filter out keep-alive chunks
f.write(chunk)

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +6
from .utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list
except ImportError:
from utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports include get_text_of_pdf_pages and structure_to_list, but they are not used in this module. Removing unused imports will avoid confusion and keeps the tool module minimal.

Suggested change
from .utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list
except ImportError:
from utils import get_page_tokens, get_text_of_pdf_pages, get_number_of_pages, remove_fields, structure_to_list
from .utils import get_page_tokens, get_number_of_pages, remove_fields
except ImportError:
from utils import get_page_tokens, get_number_of_pages, remove_fields

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +88
elif mode == "md" or (mode == "auto" and is_md):
print(f"Indexing Markdown: {file_path}")
result = asyncio.run(md_to_tree(
md_path=file_path,
if_thinning=False,
if_add_node_summary='yes',
summary_token_threshold=200,
model=self.model,
if_add_doc_description='yes',
if_add_node_text='yes',
if_add_node_id='yes'
))
self.documents[doc_id] = {
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio.run(md_to_tree(...)) will raise RuntimeError: asyncio.run() cannot be called from a running event loop in notebooks / async apps. Consider providing an async index_async() / query_agent_async() API and using await when already inside an event loop (or using a loop-aware helper) so the client works in common interactive environments.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants