Skip to content

fix: scan_full_page handles DOM-recycling virtual scroll sites (#731)#1868

Open
umerkhan95 wants to merge 1 commit intounclecode:developfrom
umerkhan95:fix/scan-full-page-virtual-scroll-731
Open

fix: scan_full_page handles DOM-recycling virtual scroll sites (#731)#1868
umerkhan95 wants to merge 1 commit intounclecode:developfrom
umerkhan95:fix/scan-full-page-virtual-scroll-731

Conversation

@umerkhan95
Copy link

Summary

Makes scan_full_page=True automatically detect and handle DOM-recycling virtual scroll pages without requiring VirtualScrollConfig or a container selector.

Before: scan_full_page=True on a virtual-scroll page captures only the last visible batch of recycled DOM elements.
After: All items captured — tested on skills.sh (9,707 elements) and 13 local patterns.

Fixes #731

List of files changed and why

  • crawl4ai/async_crawler_strategy.py — Rewrote _handle_full_page_scan with a 5-phase pipeline: (1) setup, (2) detect recycling via fingerprint comparison + MutationObserver, (3) scroll + capture with dedup for each recycling container, (4) container-scroll pass for overflow-y/x containers, (5) fallback scroll-to-bottom for append-based pages. Supports horizontal, 2D grid, multiple containers, and nested scroll.
  • crawl4ai/async_configs.py — Added max_no_change and max_captured_elements params to VirtualScrollConfig, plus from_dict forward-compat for unknown keys.
  • crawl4ai/content_scraping_strategy.py — Preserve tail text when removing elements (prevents silent text loss in lxml).
  • tests/test_virtual_scroll.py — 12 pytest tests for VirtualScrollConfig and scan_full_page paths.
  • test_virtual_scroll_compat.py — 13 standalone tests covering all major virtualisation patterns (transform, innerHTML-wipe, append, container-scroll, variable heights, horizontal, 2D grid, multiple containers, nested, async-loaded, embedded widget, 1000-item stress, real site).

How Has This Been Tested?

  • 13/13 local test patterns pass (each served from a self-contained HTML fixture via HTTPServer)
  • Real-world validation on skills.sh: 9,707 elements recovered, 9,727 unique hrefs, 981 scroll steps
  • Real-world validation on quotes.toscrape.com/scroll: 100/100 quotes captured
  • Append-based infinite scroll regression test passes (no false positives on non-recycling pages)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…code#731)

Makes scan_full_page=True automatically detect and capture content from
virtual scroll pages that recycle DOM nodes. No VirtualScrollConfig or
container selector required.

Detects three recycling patterns (transform-based, innerHTML-wipe, node
swap) via fingerprint comparison and MutationObserver. Scrolls and
captures with fingerprint-based dedup, then injects merged HTML so
page.content() returns everything.

Also handles horizontal scroll, 2D grids, multiple containers, nested
scroll, and container-level overflow-y/x — all with automatic detection.

Tested on skills.sh (9,707 elements recovered) and 13 local patterns
covering all major virtualisation strategies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant