feat: Migrate to Scrapy's native AsyncCrawlerRunner#793
feat: Migrate to Scrapy's native AsyncCrawlerRunner#793
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #793 +/- ##
==========================================
- Coverage 85.47% 85.24% -0.23%
==========================================
Files 46 46
Lines 2691 2697 +6
==========================================
- Hits 2300 2299 -1
- Misses 391 398 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dd6317e to
f831b18
Compare
Adopt Scrapy 2.14's AsyncCrawlerRunner to eliminate the Deferred conversion layer (deferred_to_future). The run_scrapy_actor function now handles asyncio reactor installation internally, removing boilerplate from user code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scrapy 2.14+ deprecated the spider argument in process_item() and newer versions no longer pass it, causing TypeError in PriceCleanerPipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f831b18 to
c013e33
Compare
There was a problem hiding this comment.
Pull request overview
This PR migrates the Apify-Scrapy integration to Scrapy 2.14’s native async APIs (AsyncCrawlerRunner) and moves Twisted reactor installation into run_scrapy_actor to reduce user boilerplate when running Scrapy inside an Actor.
Changes:
- Bump Scrapy minimum version to
>=2.14.0and update e2e fixtures/tests accordingly. - Switch sample actors/docs to
AsyncCrawlerRunnerand removedeferred_to_futureusage. - Refactor
run_scrapy_actorto install the asyncio-compatible Twisted reactor internally.
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updates locked Scrapy constraint to >=2.14.0. |
| pyproject.toml | Raises scrapy optional extra minimum version and adjusts Ruff per-file ignore list. |
| src/apify/scrapy/_actor_runner.py | Moves reactor installation into run_scrapy_actor and simplifies coroutine bridging. |
| src/apify/scrapy/pipelines/actor_dataset_push.py | Adjusts pipeline signature/logging for dataset pushes. |
| tests/unit/scrapy/pipelines/test_actor_dataset_push.py | Updates unit test expectations/calls for pipeline behavior. |
| tests/e2e/test_actor_scrapy.py | Updates actor e2e to require Scrapy >=2.14.0. |
| tests/e2e/test_scrapy/test_basic_spider.py | Updates Scrapy requirement for e2e spider fixture. |
| tests/e2e/test_scrapy/test_cb_kwargs_spider.py | Updates Scrapy requirement for e2e spider fixture. |
| tests/e2e/test_scrapy/test_crawl_spider.py | Updates Scrapy requirement for e2e spider fixture. |
| tests/e2e/test_scrapy/test_custom_pipeline_spider.py | Updates Scrapy requirement for e2e spider fixture. |
| tests/e2e/test_scrapy/test_itemloader_spider.py | Updates Scrapy requirement for e2e spider fixture. |
| tests/e2e/test_scrapy/actor_source/main.py | Updates actor entrypoint to rely on run_scrapy_actor for reactor setup. |
| tests/e2e/test_scrapy/actor_source/main.py | Switches to AsyncCrawlerRunner in e2e actor fixture code. |
| tests/e2e/test_scrapy/actor_source/main_custom_pipeline.py | Switches to AsyncCrawlerRunner in custom-pipeline e2e actor fixture code. |
| tests/e2e/test_scrapy/actor_source/pipelines.py | Updates e2e actor pipeline fixture signature. |
| docs/03_guides/code/scrapy_project/src/main.py | Removes manual install_reactor from docs example entrypoint. |
| docs/03_guides/code/scrapy_project/src/main.py | Switches docs example to AsyncCrawlerRunner and awaits crawl() directly. |
| docs/03_guides/06_scrapy.mdx | Updates guide text to reflect reactor installation handled by run_scrapy_actor. |
Comments suppressed due to low confidence (2)
docs/03_guides/code/scrapy_project/src/main.py:12
- This example still imports Scrapy (
AsyncCrawlerRunner) and the spider module at import time. Sincerun_scrapy_actor()installs the asyncio reactor only when called from__main__.py, these module-level Scrapy imports can happen before reactor installation and can prevent switching toAsyncioSelectorReactor. Consider moving Scrapy/spider imports insidemain()(or otherwise ensuring no Scrapy/Twisted reactor import occurs beforerun_scrapy_actorruns).
from scrapy.crawler import AsyncCrawlerRunner
from apify import Actor
from apify.scrapy import apply_apify_settings
# Import your Scrapy spider here.
from .spiders import TitleSpider as Spider
docs/03_guides/code/scrapy_project/src/main.py:9
run_scrapy_actor()installs the reactor when it is called, but this module imports.main(which imports Scrapy) before that happens. If importing.maintriggers Twisted reactor initialization, reactor installation insiderun_scrapy_actorcan fail. One way to avoid this is to keep Scrapy imports out of module top-level inmain.py(import them insidemain()), so importing.maindoesn’t touch Twisted/Scrapy beforerun_scrapy_actorruns.
from apify.scrapy import initialize_logging, run_scrapy_actor
# Import your main Actor coroutine here.
from .main import main
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
AsyncCrawlerRunnerto eliminate thedeferred_to_futureconversion layer.run_scrapy_actornow handlesinstall_reactorinternally, removing boilerplate from user code.Issue
AsyncCrawlerRunnerand/orAsyncCrawlerProcess#638Test plan