Skip to content

fix: scan_full_page captures all content on virtual-scroll pages (#731)#1853

Open
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-scan-full-page-virtual-scroll-731
Open

fix: scan_full_page captures all content on virtual-scroll pages (#731)#1853
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-scan-full-page-virtual-scroll-731

Conversation

@hafezparast
Copy link
Contributor

Summary

  • _handle_full_page_scan: installs a MutationObserver before scrolling to capture DOM elements that are removed during scroll (virtual scroll recycling). After scrolling, deduplicates against still-visible elements and re-injects accumulated content into a hidden <div> so page.content() includes everything.
  • _handle_virtual_scroll: falls back to window.scrollBy() when container.scrollTop has no effect (window-level scrolling, e.g. Twitter/X). Updates end-of-scroll detection accordingly.
  • Non-virtual-scroll pages are unaffected: if no elements are removed during scrolling, no injection occurs.

Before fix: scan_full_page=True on a 50-item virtual-scroll page → only 10 items (last batch)
After fix: all 50 items captured

Fixes #731, related to #1087

Test plan

  • Reproduction test (tests/test_repro_731.py) — local virtual-scroll page with 50 items, verifies all 50 captured
  • 13 adversarial tests covering: virtual scroll (50/100 items), static pages (no false positives), empty pages, append-based infinite scroll, nested virtual scroll with header/footer preservation, low max_steps, observer cleanup, element ordering, hidden container verification, deduplication
  • Full regression suite: 303 passed, 0 regressions (1 pre-existing HuggingFace failure unrelated)

🤖 Generated with Claude Code

…n virtual-scroll pages (unclecode#731)

scan_full_page previously captured HTML only once after scrolling completed,
so pages that recycle DOM elements (Twitter/X, Xiaohongshu, etc.) only
returned the last visible batch. VirtualScrollConfig had a similar gap:
it used container.scrollTop which has no effect when the window is the
scrollable element.

_handle_full_page_scan:
- Install a MutationObserver before scrolling that captures every element
  removed from the DOM (i.e. recycled by virtual scroll).
- After scrolling, deduplicate the accumulated elements against what is
  still visible, then re-inject them into a hidden container so that the
  subsequent page.content() call includes everything.
- Clean up the observer on both success and error paths.

_handle_virtual_scroll:
- Fall back to window.scrollBy() when container.scrollTop has no effect
  (window-level scrolling, e.g. Twitter).
- Update end-of-scroll detection to check window position when using the
  window fallback.

Tested with a local virtual-scroll page (50 items, 10 visible at a time):
- Before: 10/50 captured (only last batch)
- After:  50/50 captured

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant