fix: scan_full_page handles DOM-recycling virtual scroll sites (#731)#1868
Open
umerkhan95 wants to merge 1 commit intounclecode:developfrom
Open
fix: scan_full_page handles DOM-recycling virtual scroll sites (#731)#1868umerkhan95 wants to merge 1 commit intounclecode:developfrom
umerkhan95 wants to merge 1 commit intounclecode:developfrom
Conversation
…code#731) Makes scan_full_page=True automatically detect and capture content from virtual scroll pages that recycle DOM nodes. No VirtualScrollConfig or container selector required. Detects three recycling patterns (transform-based, innerHTML-wipe, node swap) via fingerprint comparison and MutationObserver. Scrolls and captures with fingerprint-based dedup, then injects merged HTML so page.content() returns everything. Also handles horizontal scroll, 2D grids, multiple containers, nested scroll, and container-level overflow-y/x — all with automatic detection. Tested on skills.sh (9,707 elements recovered) and 13 local patterns covering all major virtualisation strategies.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes
scan_full_page=Trueautomatically detect and handle DOM-recycling virtual scroll pages without requiringVirtualScrollConfigor a container selector.Before:
scan_full_page=Trueon a virtual-scroll page captures only the last visible batch of recycled DOM elements.After: All items captured — tested on skills.sh (9,707 elements) and 13 local patterns.
Fixes #731
List of files changed and why
crawl4ai/async_crawler_strategy.py— Rewrote_handle_full_page_scanwith a 5-phase pipeline: (1) setup, (2) detect recycling via fingerprint comparison + MutationObserver, (3) scroll + capture with dedup for each recycling container, (4) container-scroll pass for overflow-y/x containers, (5) fallback scroll-to-bottom for append-based pages. Supports horizontal, 2D grid, multiple containers, and nested scroll.crawl4ai/async_configs.py— Addedmax_no_changeandmax_captured_elementsparams toVirtualScrollConfig, plusfrom_dictforward-compat for unknown keys.crawl4ai/content_scraping_strategy.py— Preserve tail text when removing elements (prevents silent text loss in lxml).tests/test_virtual_scroll.py— 12 pytest tests forVirtualScrollConfigandscan_full_pagepaths.test_virtual_scroll_compat.py— 13 standalone tests covering all major virtualisation patterns (transform, innerHTML-wipe, append, container-scroll, variable heights, horizontal, 2D grid, multiple containers, nested, async-loaded, embedded widget, 1000-item stress, real site).How Has This Been Tested?
Checklist