fix: scan_full_page handles DOM-recycling virtual scroll sites (#731) by umerkhan95 · Pull Request #1868 · unclecode/crawl4ai

umerkhan95 · 2026-03-26T06:33:38Z

Summary

Makes scan_full_page=True automatically detect and handle DOM-recycling virtual scroll pages without requiring VirtualScrollConfig or a container selector.

Before: scan_full_page=True on a virtual-scroll page captures only the last visible batch of recycled DOM elements.
After: All items captured — tested on skills.sh (9,707 elements) and 13 local patterns.

Fixes #731

List of files changed and why

crawl4ai/async_crawler_strategy.py — Rewrote _handle_full_page_scan with a 5-phase pipeline: (1) setup, (2) detect recycling via fingerprint comparison + MutationObserver, (3) scroll + capture with dedup for each recycling container, (4) container-scroll pass for overflow-y/x containers, (5) fallback scroll-to-bottom for append-based pages. Supports horizontal, 2D grid, multiple containers, and nested scroll.
crawl4ai/async_configs.py — Added max_no_change and max_captured_elements params to VirtualScrollConfig, plus from_dict forward-compat for unknown keys.
crawl4ai/content_scraping_strategy.py — Preserve tail text when removing elements (prevents silent text loss in lxml).
tests/test_virtual_scroll.py — 12 pytest tests for VirtualScrollConfig and scan_full_page paths.
test_virtual_scroll_compat.py — 13 standalone tests covering all major virtualisation patterns (transform, innerHTML-wipe, append, container-scroll, variable heights, horizontal, 2D grid, multiple containers, nested, async-loaded, embedded widget, 1000-item stress, real site).

How Has This Been Tested?

13/13 local test patterns pass (each served from a self-contained HTML fixture via HTTPServer)
Real-world validation on skills.sh: 9,707 elements recovered, 9,727 unique hrefs, 981 scroll steps
Real-world validation on quotes.toscrape.com/scroll: 100/100 quotes captured
Append-based infinite scroll regression test passes (no false positives on non-recycling pages)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…code#731) Makes scan_full_page=True automatically detect and capture content from virtual scroll pages that recycle DOM nodes. No VirtualScrollConfig or container selector required. Detects three recycling patterns (transform-based, innerHTML-wipe, node swap) via fingerprint comparison and MutationObserver. Scrolls and captures with fingerprint-based dedup, then injects merged HTML so page.content() returns everything. Also handles horizontal scroll, 2D grids, multiple containers, nested scroll, and container-level overflow-y/x — all with automatic detection. Tested on skills.sh (9,707 elements recovered) and 13 local patterns covering all major virtualisation strategies.

umerkhan95 mentioned this pull request Mar 26, 2026

[Bug]: When extracting data with scroll_full_page, only the final elements get parsed #731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: scan_full_page handles DOM-recycling virtual scroll sites (#731)#1868

fix: scan_full_page handles DOM-recycling virtual scroll sites (#731)#1868
umerkhan95 wants to merge 1 commit intounclecode:developfrom
umerkhan95:fix/scan-full-page-virtual-scroll-731

umerkhan95 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

umerkhan95 commented Mar 26, 2026

Summary

List of files changed and why

How Has This Been Tested?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant