Skip to content

feat: Migrate to Scrapy's native AsyncCrawlerRunner#793

Open
vdusek wants to merge 3 commits intomasterfrom
fix/scrapy-async-crawler-runner
Open

feat: Migrate to Scrapy's native AsyncCrawlerRunner#793
vdusek wants to merge 3 commits intomasterfrom
fix/scrapy-async-crawler-runner

Conversation

@vdusek
Copy link
Contributor

@vdusek vdusek commented Feb 16, 2026

Description

  • Adopt Scrapy 2.14's AsyncCrawlerRunner to eliminate the deferred_to_future conversion layer.
  • Function run_scrapy_actor now handles install_reactor internally, removing boilerplate from user code.

Issue

Test plan

  • CI passes

@vdusek vdusek self-assigned this Feb 16, 2026
@github-actions github-actions bot added this to the 134th sprint - Tooling team milestone Feb 16, 2026
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Feb 16, 2026
@vdusek vdusek added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 16, 2026
@codecov
Copy link

codecov bot commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 7.69231% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.24%. Comparing base (e1bdbc9) to head (8bfc41c).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/apify/scrapy/_actor_runner.py 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #793      +/-   ##
==========================================
- Coverage   85.47%   85.24%   -0.23%     
==========================================
  Files          46       46              
  Lines        2691     2697       +6     
==========================================
- Hits         2300     2299       -1     
- Misses        391      398       +7     
Flag Coverage Δ
e2e 35.40% <0.00%> (?)
integration 57.50% <0.00%> (-0.13%) ⬇️
unit 72.19% <7.69%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vdusek vdusek changed the title fix: migrate to Scrapy's native AsyncCrawlerRunner fix: Migrate to Scrapy's native AsyncCrawlerRunner Feb 16, 2026
@vdusek vdusek changed the title fix: Migrate to Scrapy's native AsyncCrawlerRunner feat: Migrate to Scrapy's native AsyncCrawlerRunner Feb 16, 2026
@vdusek vdusek force-pushed the fix/scrapy-async-crawler-runner branch from dd6317e to f831b18 Compare February 18, 2026 08:03
vdusek and others added 2 commits February 18, 2026 18:28
Adopt Scrapy 2.14's AsyncCrawlerRunner to eliminate the Deferred conversion
layer (deferred_to_future). The run_scrapy_actor function now handles
asyncio reactor installation internally, removing boilerplate from user code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scrapy 2.14+ deprecated the spider argument in process_item() and newer
versions no longer pass it, causing TypeError in PriceCleanerPipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vdusek vdusek force-pushed the fix/scrapy-async-crawler-runner branch from f831b18 to c013e33 Compare February 18, 2026 17:29
@vdusek vdusek marked this pull request as ready for review February 18, 2026 17:29
@vdusek vdusek requested review from Pijukatel and Copilot February 18, 2026 17:29
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the Apify-Scrapy integration to Scrapy 2.14’s native async APIs (AsyncCrawlerRunner) and moves Twisted reactor installation into run_scrapy_actor to reduce user boilerplate when running Scrapy inside an Actor.

Changes:

  • Bump Scrapy minimum version to >=2.14.0 and update e2e fixtures/tests accordingly.
  • Switch sample actors/docs to AsyncCrawlerRunner and remove deferred_to_future usage.
  • Refactor run_scrapy_actor to install the asyncio-compatible Twisted reactor internally.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
uv.lock Updates locked Scrapy constraint to >=2.14.0.
pyproject.toml Raises scrapy optional extra minimum version and adjusts Ruff per-file ignore list.
src/apify/scrapy/_actor_runner.py Moves reactor installation into run_scrapy_actor and simplifies coroutine bridging.
src/apify/scrapy/pipelines/actor_dataset_push.py Adjusts pipeline signature/logging for dataset pushes.
tests/unit/scrapy/pipelines/test_actor_dataset_push.py Updates unit test expectations/calls for pipeline behavior.
tests/e2e/test_actor_scrapy.py Updates actor e2e to require Scrapy >=2.14.0.
tests/e2e/test_scrapy/test_basic_spider.py Updates Scrapy requirement for e2e spider fixture.
tests/e2e/test_scrapy/test_cb_kwargs_spider.py Updates Scrapy requirement for e2e spider fixture.
tests/e2e/test_scrapy/test_crawl_spider.py Updates Scrapy requirement for e2e spider fixture.
tests/e2e/test_scrapy/test_custom_pipeline_spider.py Updates Scrapy requirement for e2e spider fixture.
tests/e2e/test_scrapy/test_itemloader_spider.py Updates Scrapy requirement for e2e spider fixture.
tests/e2e/test_scrapy/actor_source/main.py Updates actor entrypoint to rely on run_scrapy_actor for reactor setup.
tests/e2e/test_scrapy/actor_source/main.py Switches to AsyncCrawlerRunner in e2e actor fixture code.
tests/e2e/test_scrapy/actor_source/main_custom_pipeline.py Switches to AsyncCrawlerRunner in custom-pipeline e2e actor fixture code.
tests/e2e/test_scrapy/actor_source/pipelines.py Updates e2e actor pipeline fixture signature.
docs/03_guides/code/scrapy_project/src/main.py Removes manual install_reactor from docs example entrypoint.
docs/03_guides/code/scrapy_project/src/main.py Switches docs example to AsyncCrawlerRunner and awaits crawl() directly.
docs/03_guides/06_scrapy.mdx Updates guide text to reflect reactor installation handled by run_scrapy_actor.
Comments suppressed due to low confidence (2)

docs/03_guides/code/scrapy_project/src/main.py:12

  • This example still imports Scrapy (AsyncCrawlerRunner) and the spider module at import time. Since run_scrapy_actor() installs the asyncio reactor only when called from __main__.py, these module-level Scrapy imports can happen before reactor installation and can prevent switching to AsyncioSelectorReactor. Consider moving Scrapy/spider imports inside main() (or otherwise ensuring no Scrapy/Twisted reactor import occurs before run_scrapy_actor runs).
from scrapy.crawler import AsyncCrawlerRunner

from apify import Actor
from apify.scrapy import apply_apify_settings

# Import your Scrapy spider here.
from .spiders import TitleSpider as Spider

docs/03_guides/code/scrapy_project/src/main.py:9

  • run_scrapy_actor() installs the reactor when it is called, but this module imports .main (which imports Scrapy) before that happens. If importing .main triggers Twisted reactor initialization, reactor installation inside run_scrapy_actor can fail. One way to avoid this is to keep Scrapy imports out of module top-level in main.py (import them inside main()), so importing .main doesn’t touch Twisted/Scrapy before run_scrapy_actor runs.
from apify.scrapy import initialize_logging, run_scrapy_actor

# Import your main Actor coroutine here.
from .main import main


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Utilize Scrapy's native async runners - AsyncCrawlerRunner and/or AsyncCrawlerProcess

2 participants

Comments