fix: scan_full_page captures all content on virtual-scroll pages (#731) by hafezparast · Pull Request #1853 · unclecode/crawl4ai

hafezparast · 2026-03-23T06:12:52Z

Summary

_handle_full_page_scan: installs a MutationObserver before scrolling to capture DOM elements that are removed during scroll (virtual scroll recycling). After scrolling, deduplicates against still-visible elements and re-injects accumulated content into a hidden <div> so page.content() includes everything.
_handle_virtual_scroll: falls back to window.scrollBy() when container.scrollTop has no effect (window-level scrolling, e.g. Twitter/X). Updates end-of-scroll detection accordingly.
Non-virtual-scroll pages are unaffected: if no elements are removed during scrolling, no injection occurs.

Before fix: scan_full_page=True on a 50-item virtual-scroll page → only 10 items (last batch)
After fix: all 50 items captured

Fixes #731, related to #1087

Test plan

Reproduction test (tests/test_repro_731.py) — local virtual-scroll page with 50 items, verifies all 50 captured
13 adversarial tests covering: virtual scroll (50/100 items), static pages (no false positives), empty pages, append-based infinite scroll, nested virtual scroll with header/footer preservation, low max_steps, observer cleanup, element ordering, hidden container verification, deduplication
Full regression suite: 303 passed, 0 regressions (1 pre-existing HuggingFace failure unrelated)

🤖 Generated with Claude Code

…n virtual-scroll pages (unclecode#731) scan_full_page previously captured HTML only once after scrolling completed, so pages that recycle DOM elements (Twitter/X, Xiaohongshu, etc.) only returned the last visible batch. VirtualScrollConfig had a similar gap: it used container.scrollTop which has no effect when the window is the scrollable element. _handle_full_page_scan: - Install a MutationObserver before scrolling that captures every element removed from the DOM (i.e. recycled by virtual scroll). - After scrolling, deduplicate the accumulated elements against what is still visible, then re-inject them into a hidden container so that the subsequent page.content() call includes everything. - Clean up the observer on both success and error paths. _handle_virtual_scroll: - Fall back to window.scrollBy() when container.scrollTop has no effect (window-level scrolling, e.g. Twitter). - Update end-of-scroll detection to check window position when using the window fallback. Tested with a local virtual-scroll page (50 items, 10 visible at a time): - Before: 10/50 captured (only last batch) - After: 50/50 captured Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hafezparast mentioned this pull request Mar 23, 2026

[Bug]: When extracting data with scroll_full_page, only the final elements get parsed #731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: scan_full_page captures all content on virtual-scroll pages (#731)#1853

fix: scan_full_page captures all content on virtual-scroll pages (#731)#1853
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-scan-full-page-virtual-scroll-731

hafezparast commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hafezparast commented Mar 23, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant