Skip to content

Comments

[FEAT] Expose total_pages_processed in execution API response#1801

Open
pk-zipstack wants to merge 3 commits intomainfrom
feat/expose-total-pages-in-extraction-llm-metadata
Open

[FEAT] Expose total_pages_processed in execution API response#1801
pk-zipstack wants to merge 3 commits intomainfrom
feat/expose-total-pages-in-extraction-llm-metadata

Conversation

@pk-zipstack
Copy link
Contributor

@pk-zipstack pk-zipstack commented Feb 23, 2026

What

  • Add UsageHelper.get_aggregated_pages_processed() to aggregate page count from PageUsage model per run_id
  • Add aggregated_total_pages_processed property on WorkflowExecution model to aggregate pages across all file executions
  • Enrich per-file metadata with total_pages_processed for API destinations in _process_final_output()
  • Enrich combined metadata with total_pages_processed for DB destinations in get_combined_metadata()
  • Expose aggregated_total_pages_processed in ExecutionSerializer for the execution list API

Why

  • The PageUsage model already tracks pages_processed per run_id (file execution ID), but this data was not surfaced in API responses
  • Exposing page counts alongside token usage and cost data enables clients to track document processing volume per execution

How

  • usage_v2/helper.py: Added get_aggregated_pages_processed(run_id) static method that queries PageUsage.objects.filter(run_id=run_id) and aggregates Sum('pages_processed'), returning int | None
  • workflow_v2/models/execution.py: Added aggregated_total_pages_processed property that collects file execution IDs via self.file_executions, converts to strings, and queries PageUsage with run_id__in
  • workflow_v2/file_execution_tasks.py: In _process_final_output(), after destination.get_metadata() for API destinations, injects total_pages_processed into execution_metadata
  • endpoint_v2/destination.py: In get_combined_metadata(), adds total_pages_processed alongside existing usage token data
  • execution/serializer/execution.py: Added aggregated_total_pages_processed as a SerializerMethodField

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

  • No. All changes are additive — new helper method, new model property, new serializer field, and new keys in metadata dicts. No existing fields or behavior are modified. The total_pages_processed field gracefully returns None when no PageUsage data exists.

Database Migrations

  • None

Env Config

  • None

Relevant Docs

  • N/A

Related Issues or PRs

  • N/A

Dependencies Versions

  • No changes

Notes on Testing

  1. Trigger a workflow execution via API deployment and verify total_pages_processed appears in per-file response metadata
  2. Trigger a workflow execution via DB destination and verify total_pages_processed appears in combined metadata
  3. Call the execution list API and verify aggregated_total_pages_processed appears in the response
  4. Verify None is returned gracefully when no PageUsage records exist for an execution

Screenshots

image

Checklist

I have read and understood the Contribution Guidelines.

pk-zipstack and others added 2 commits February 23, 2026 14:21
…adata

Surface page usage data from PageUsage model in API responses to support
tracking total pages processed per file execution and per workflow execution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Summary by CodeRabbit

  • New Features
    • Aggregated total pages processed is now tracked for workflow executions.
    • The aggregated value is exposed in execution details and serialized output, and added to API destination metadata for runs.
    • If no pages are found for a run, the field will be empty/blank.

Walkthrough

Adds aggregated pages-processed tracking: a helper that sums PageUsage by run_id(s), a WorkflowExecution property exposing the aggregate, a serializer field, and enrichment of endpoint/task metadata with the aggregated total_pages_processed. Returns None when no input or no matching records.

Changes

Cohort / File(s) Summary
Usage Aggregation Helper
backend/usage_v2/helper.py
Added `UsageHelper.get_aggregated_pages_processed(run_id: str
Model Layer
backend/workflow_manager/workflow_v2/models/execution.py
Added aggregated_total_pages_processed property on WorkflowExecution that collects related file_execution IDs and delegates aggregation to UsageHelper.get_aggregated_pages_processed(run_ids=...).
Serializer
backend/workflow_manager/execution/serializer/execution.py
Added aggregated_total_pages_processed SerializerMethodField and get_aggregated_total_pages_processed to expose the model property in serialized output.
Endpoint & Task Enrichment
backend/workflow_manager/endpoint_v2/destination.py, backend/workflow_manager/workflow_v2/file_execution_tasks.py
Attach total_pages_processed to combined metadata / execution metadata for API destinations by calling UsageHelper.get_aggregated_pages_processed(run_id=...).

Sequence Diagram

sequenceDiagram
    participant Client
    participant Endpoint as Endpoint/Destination
    participant Task as FileExecutionTasks
    participant Serializer
    participant Model as WorkflowExecution
    participant Helper as UsageHelper
    participant DB as PageUsage

    Client->>Endpoint: request combined metadata
    Endpoint->>Helper: get_aggregated_pages_processed(run_id)
    Helper->>DB: query PageUsage by run_id/run_ids
    DB-->>Helper: return records
    Helper-->>Endpoint: return summed pages_processed
    Endpoint-->>Client: return metadata (includes total_pages_processed)

    Task->>Helper: get_aggregated_pages_processed(run_id)
    Helper->>DB: query PageUsage
    DB-->>Helper: return records
    Helper-->>Task: return sum
    Task->>Serializer: include total_pages_processed in execution metadata

    Serializer->>Model: access aggregated_total_pages_processed
    Model->>Helper: get_aggregated_pages_processed(run_ids)
    Helper-->>Model: return sum
    Serializer-->>Client: serialized execution with aggregated_total_pages_processed
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: exposing total_pages_processed in the execution API response, which aligns with the core objective of the PR.
Description check ✅ Passed The description is comprehensive and well-structured, covering all required template sections including What, Why, How, breaking changes assessment, testing notes, and includes visual evidence via a screenshot.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/expose-total-pages-in-extraction-llm-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
backend/workflow_manager/workflow_v2/models/execution.py (2)

262-280: PageUsage.run_id has no database index, but is now on multiple hot query paths.

The PageUsage model only indexes organization_id. Both filter(run_id=run_id) (in UsageHelper.get_aggregated_pages_processed) and filter(run_id__in=str_ids) (here) will full-scan the page_usage table as it grows. These queries are now triggered per-file in API deployments and per-row in execution list views.

A migration adding a db_index=True on run_id (or a Meta.indexes entry) is recommended:

# In account_usage/models.py – PageUsage.Meta
indexes = [
    models.Index(fields=["organization_id"]),
    models.Index(fields=["run_id"]),   # +
]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/workflow_manager/workflow_v2/models/execution.py` around lines 262 -
280, Add a database index on PageUsage.run_id to avoid full table scans for
queries used by aggregated_total_pages_processed and
UsageHelper.get_aggregated_pages_processed: update the PageUsage model Meta to
include an index for "run_id" (e.g., add models.Index(fields=["run_id"])
alongside the existing organization_id index) and create/apply a Django
migration so the new index is created in the database.

274-280: Same redundant .exists() + .aggregate() double-query as in UsageHelper.

Sum("pages_processed") on an empty queryset returns None for the key, so the .exists() check buys nothing. Collapsing to a single .aggregate() call saves one round-trip per property access (and this property is called per row in list-view serialization).

♻️ Proposed fix
-        str_ids = [str(fid) for fid in file_execution_ids]
-        queryset = PageUsage.objects.filter(run_id__in=str_ids)
-        if not queryset.exists():
-            return None
-
-        result = queryset.aggregate(total_pages=Sum("pages_processed"))
-        return result.get("total_pages")
+        str_ids = [str(fid) for fid in file_execution_ids]
+        result = PageUsage.objects.filter(run_id__in=str_ids).aggregate(
+            total_pages=Sum("pages_processed")
+        )
+        return result.get("total_pages")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/workflow_manager/workflow_v2/models/execution.py` around lines 274 -
280, The current code does a redundant .exists() followed by .aggregate() which
causes an extra DB round-trip; replace the two-step pattern in the block that
builds str_ids and assigns queryset =
PageUsage.objects.filter(run_id__in=str_ids) with a single aggregate call: call
PageUsage.objects.filter(run_id__in=str_ids).aggregate(total_pages=Sum("pages_processed"))
and return result.get("total_pages") (this preserves None for no rows) — update
the logic around the variables file_execution_ids, queryset and the use of
Sum("pages_processed")/pages_processed to remove the .exists() check and avoid
the double query.
backend/workflow_manager/execution/serializer/execution.py (1)

36-38: aggregated_total_pages_processed adds up to 3 extra DB queries per execution in list views.

The model property fires: (1) a values_list on file_executions, (2) a PageUsage.exists(), (3) a PageUsage.aggregate(). Combined with the existing per-item queries for get_successful_files and get_failed_files, execution list endpoints are now executing ~7 queries per row. The class-level TODO already calls this out; this field makes addressing it more urgent.

Consider annotating the aggregate in the queryset that feeds the list view (e.g., via a subquery annotation or a bulk prefetch_related approach) rather than resolving it lazily per object.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/workflow_manager/execution/serializer/execution.py` around lines 36 -
38, The current get_aggregated_total_pages_processed serializer method calls the
WorkflowExecution model property aggregated_total_pages_processed which triggers
multiple DB queries per row; instead annotate the queryset that feeds the list
view with the aggregated total (using a Subquery/OuterRef or a bulk aggregation
with Prefetch over file_executions/PageUsage) and change
get_aggregated_total_pages_processed to return that annotated value (e.g., read
obj.annotated_aggregated_total_pages_processed or similar) so no per-object
queries are run; update the view/queryset that constructs the list to include
the annotation name you choose and ensure the serializer reads that attribute
rather than accessing the model property.
backend/usage_v2/helper.py (1)

37-45: Redundant .exists() check creates an unnecessary extra DB query.

Sum on an empty queryset already returns None for the aggregated key, so result.get("total_pages") will return None when there are no records — the .exists() guard is superfluous. The sibling method get_aggregated_token_count uses a single .aggregate() call for this reason.

Additionally, two static-analysis hints are valid here:

  • BLE001: Replace bare except Exception with a narrower exception type, or at minimum annotate the intent.
  • TRY400: logger.error suppresses the traceback; logger.exception (or logger.error(..., exc_info=True)) is preferred.
♻️ Proposed fix
-        try:
-            queryset = PageUsage.objects.filter(run_id=run_id)
-            if not queryset.exists():
-                return None
-            result = queryset.aggregate(total_pages=Sum("pages_processed"))
-            return result.get("total_pages")
-        except Exception as e:
-            logger.error(f"Error aggregating pages processed for run_id {run_id}: {e}")
-            return None
+        try:
+            result = PageUsage.objects.filter(run_id=run_id).aggregate(
+                total_pages=Sum("pages_processed")
+            )
+            return result.get("total_pages")
+        except Exception as e:  # noqa: BLE001
+            logger.exception(f"Error aggregating pages processed for run_id {run_id}: {e}")
+            return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/usage_v2/helper.py` around lines 37 - 45, Remove the redundant
.exists() check and perform a single aggregate call on
PageUsage.objects.filter(run_id=run_id) returning result.get("total_pages")
(same pattern as get_aggregated_token_count); replace the bare except Exception
with a narrower exception (e.g., catch django.db.DatabaseError) and log the
failure with full traceback using logger.exception(...) (or logger.error(...,
exc_info=True)) to preserve stack information while keeping the behavior of
returning None on error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/usage_v2/helper.py`:
- Around line 37-45: Remove the redundant .exists() check and perform a single
aggregate call on PageUsage.objects.filter(run_id=run_id) returning
result.get("total_pages") (same pattern as get_aggregated_token_count); replace
the bare except Exception with a narrower exception (e.g., catch
django.db.DatabaseError) and log the failure with full traceback using
logger.exception(...) (or logger.error(..., exc_info=True)) to preserve stack
information while keeping the behavior of returning None on error.

In `@backend/workflow_manager/execution/serializer/execution.py`:
- Around line 36-38: The current get_aggregated_total_pages_processed serializer
method calls the WorkflowExecution model property
aggregated_total_pages_processed which triggers multiple DB queries per row;
instead annotate the queryset that feeds the list view with the aggregated total
(using a Subquery/OuterRef or a bulk aggregation with Prefetch over
file_executions/PageUsage) and change get_aggregated_total_pages_processed to
return that annotated value (e.g., read
obj.annotated_aggregated_total_pages_processed or similar) so no per-object
queries are run; update the view/queryset that constructs the list to include
the annotation name you choose and ensure the serializer reads that attribute
rather than accessing the model property.

In `@backend/workflow_manager/workflow_v2/models/execution.py`:
- Around line 262-280: Add a database index on PageUsage.run_id to avoid full
table scans for queries used by aggregated_total_pages_processed and
UsageHelper.get_aggregated_pages_processed: update the PageUsage model Meta to
include an index for "run_id" (e.g., add models.Index(fields=["run_id"])
alongside the existing organization_id index) and create/apply a Django
migration so the new index is created in the database.
- Around line 274-280: The current code does a redundant .exists() followed by
.aggregate() which causes an extra DB round-trip; replace the two-step pattern
in the block that builds str_ids and assigns queryset =
PageUsage.objects.filter(run_id__in=str_ids) with a single aggregate call: call
PageUsage.objects.filter(run_id__in=str_ids).aggregate(total_pages=Sum("pages_processed"))
and return result.get("total_pages") (this preserves None for no rows) — update
the logic around the variables file_execution_ids, queryset and the use of
Sum("pages_processed")/pages_processed to remove the .exists() check and avoid
the double query.
ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d7babec and 25de508.

📒 Files selected for processing (5)
  • backend/usage_v2/helper.py
  • backend/workflow_manager/endpoint_v2/destination.py
  • backend/workflow_manager/execution/serializer/execution.py
  • backend/workflow_manager/workflow_v2/file_execution_tasks.py
  • backend/workflow_manager/workflow_v2/models/execution.py

Comment on lines 43 to 45
except Exception as e:
logger.error(f"Error aggregating pages processed for run_id {run_id}: {e}")
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Check this exception handling behaviour. In almost all cases its better to let the exception bubble up and let it get handled by the middleware. Test it after removing this - we need to ensure that user sees an appropriate error message (need not be too specific) but at the same time, we should log it with a traceback if its a 5xx error


class UsageHelper:
@staticmethod
def get_aggregated_pages_processed(run_id: str) -> int | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some code duplication between this method and aggregated_total_pages_processed() of WorkflowExecution model. Unify these implementations

Comment on lines 275 to 279
queryset = PageUsage.objects.filter(run_id__in=str_ids)
if not queryset.exists():
return None

result = queryset.aggregate(total_pages=Sum("pages_processed"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Code duplication

…ption handling

- Unify UsageHelper.get_aggregated_pages_processed() to accept either
  run_id or run_ids, eliminating duplicate PageUsage query logic
- WorkflowExecution.aggregated_total_pages_processed now delegates to
  UsageHelper instead of duplicating the aggregation
- Remove broad try/except so exceptions bubble up to middleware

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link

@github-actions
Copy link
Contributor

Test Results

Summary
  • Runner Tests: 11 passed, 0 failed (11 total)
  • SDK1 Tests: 66 passed, 0 failed (66 total)

Runner Tests - Full Report
filepath function $$\textcolor{#23d18b}{\tt{passed}}$$ SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_logs}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_client\_init}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_run\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$ $$\textcolor{#23d18b}{\tt{11}}$$ $$\textcolor{#23d18b}{\tt{11}}$$
SDK1 Tests - Full Report
filepath function $$\textcolor{#23d18b}{\tt{passed}}$$ SUBTOTAL
$$\textcolor{#23d18b}{\tt{tests/test\_platform.py}}$$ $$\textcolor{#23d18b}{\tt{TestPlatformHelperRetry.test\_success\_on\_first\_attempt}}$$ $$\textcolor{#23d18b}{\tt{2}}$$ $$\textcolor{#23d18b}{\tt{2}}$$
$$\textcolor{#23d18b}{\tt{tests/test\_platform.py}}$$ $$\textcolor{#23d18b}{\tt{TestPlatformHelperRetry.test\_retry\_on\_connection\_error}}$$ $$\textcolor{#23d18b}{\tt{2}}$$ $$\textcolor{#23d18b}{\tt{2}}$$
$$\textcolor{#23d18b}{\tt{tests/test\_platform.py}}$$ $$\textcolor{#23d18b}{\tt{TestPlatformHelperRetry.test\_non\_retryable\_http\_error}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/test\_platform.py}}$$ $$\textcolor{#23d18b}{\tt{TestPlatformHelperRetry.test\_retryable\_http\_errors}}$$ $$\textcolor{#23d18b}{\tt{3}}$$ $$\textcolor{#23d18b}{\tt{3}}$$
$$\textcolor{#23d18b}{\tt{tests/test\_platform.py}}$$ $$\textcolor{#23d18b}{\tt{TestPlatformHelperRetry.test\_post\_method\_retry}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/test\_platform.py}}$$ $$\textcolor{#23d18b}{\tt{TestPlatformHelperRetry.test\_retry\_logging}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/test\_prompt.py}}$$ $$\textcolor{#23d18b}{\tt{TestPromptToolRetry.test\_success\_on\_first\_attempt}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/test\_prompt.py}}$$ $$\textcolor{#23d18b}{\tt{TestPromptToolRetry.test\_retry\_on\_errors}}$$ $$\textcolor{#23d18b}{\tt{2}}$$ $$\textcolor{#23d18b}{\tt{2}}$$
$$\textcolor{#23d18b}{\tt{tests/test\_prompt.py}}$$ $$\textcolor{#23d18b}{\tt{TestPromptToolRetry.test\_wrapper\_methods\_retry}}$$ $$\textcolor{#23d18b}{\tt{4}}$$ $$\textcolor{#23d18b}{\tt{4}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestIsRetryableError.test\_connection\_error\_is\_retryable}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestIsRetryableError.test\_timeout\_is\_retryable}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestIsRetryableError.test\_http\_error\_retryable\_status\_codes}}$$ $$\textcolor{#23d18b}{\tt{3}}$$ $$\textcolor{#23d18b}{\tt{3}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestIsRetryableError.test\_http\_error\_non\_retryable\_status\_codes}}$$ $$\textcolor{#23d18b}{\tt{5}}$$ $$\textcolor{#23d18b}{\tt{5}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestIsRetryableError.test\_http\_error\_without\_response}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestIsRetryableError.test\_os\_error\_retryable\_errno}}$$ $$\textcolor{#23d18b}{\tt{5}}$$ $$\textcolor{#23d18b}{\tt{5}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestIsRetryableError.test\_os\_error\_non\_retryable\_errno}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestIsRetryableError.test\_other\_exception\_not\_retryable}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCalculateDelay.test\_exponential\_backoff\_without\_jitter}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCalculateDelay.test\_exponential\_backoff\_with\_jitter}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCalculateDelay.test\_max\_delay\_cap}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCalculateDelay.test\_max\_delay\_cap\_with\_jitter}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryWithExponentialBackoff.test\_successful\_call\_first\_attempt}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryWithExponentialBackoff.test\_retry\_after\_transient\_failure}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryWithExponentialBackoff.test\_max\_retries\_exceeded}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryWithExponentialBackoff.test\_max\_time\_exceeded}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryWithExponentialBackoff.test\_retry\_with\_custom\_predicate}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryWithExponentialBackoff.test\_no\_retry\_with\_predicate\_false}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryWithExponentialBackoff.test\_exception\_not\_in\_tuple\_not\_retried}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryWithExponentialBackoff.test\_delay\_would\_exceed\_max\_time}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_default\_configuration}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_environment\_variable\_configuration}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_invalid\_max\_retries}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_invalid\_max\_time}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_invalid\_base\_delay}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_invalid\_multiplier}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_jitter\_values}}$$ $$\textcolor{#23d18b}{\tt{2}}$$ $$\textcolor{#23d18b}{\tt{2}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_custom\_exceptions\_only}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_custom\_predicate\_only}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_both\_exceptions\_and\_predicate}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestCreateRetryDecorator.test\_exceptions\_match\_but\_predicate\_false}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestPreconfiguredDecorators.test\_retry\_platform\_service\_call\_exists}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestPreconfiguredDecorators.test\_retry\_prompt\_service\_call\_exists}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestPreconfiguredDecorators.test\_platform\_service\_decorator\_retries\_on\_connection\_error}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestPreconfiguredDecorators.test\_prompt\_service\_decorator\_retries\_on\_timeout}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryLogging.test\_warning\_logged\_on\_retry}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryLogging.test\_info\_logged\_on\_success\_after\_retry}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{tests/utils/test\_retry\_utils.py}}$$ $$\textcolor{#23d18b}{\tt{TestRetryLogging.test\_exception\_logged\_on\_giving\_up}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$ $$\textcolor{#23d18b}{\tt{66}}$$ $$\textcolor{#23d18b}{\tt{66}}$$

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/usage_v2/helper.py (1)

49-52: Remove the redundant exists() check — it causes an extra DB round-trip.

Django's Sum on an empty queryset returns {"total_pages": None}, so result.get("total_pages") already returns None with no records present. The exists() guard adds a second database hit for every call without changing the outcome.

♻️ Proposed simplification
-        if not queryset.exists():
-            return None
-        result = queryset.aggregate(total_pages=Sum("pages_processed"))
-        return result.get("total_pages")
+        result = queryset.aggregate(total_pages=Sum("pages_processed"))
+        return result.get("total_pages")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/usage_v2/helper.py` around lines 49 - 52, The current code performs
an extra DB round-trip by calling queryset.exists() before aggregating; remove
the exists() check and directly call
queryset.aggregate(total_pages=Sum("pages_processed")) and return
result.get("total_pages") — keep the aggregate on the same queryset (using Sum
and "pages_processed") so empty querysets yield None without the extra exists()
query.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/workflow_manager/workflow_v2/models/execution.py`:
- Around line 261-277: The aggregated_total_pages_processed property issues
per-object DB queries (file_executions.values_list(...) and
UsageHelper.get_aggregated_pages_processed(...)) causing an N+1 when used in
ExecutionSerializer list endpoints; fix by either adding a DB index on
PageUsage.run_id to speed each aggregate query and adding a migration for that
index, or (preferred) batch the totals in the list view/serializer by computing
aggregated page totals for all execution IDs in one query and attaching them to
the queryset (override the list view or ExecutionSerializer to accept a
precomputed map keyed by execution id and avoid calling
aggregated_total_pages_processed per instance); reference the property
aggregated_total_pages_processed, the call to file_executions.values_list, and
UsageHelper.get_aggregated_pages_processed when implementing the change.

---

Nitpick comments:
In `@backend/usage_v2/helper.py`:
- Around line 49-52: The current code performs an extra DB round-trip by calling
queryset.exists() before aggregating; remove the exists() check and directly
call queryset.aggregate(total_pages=Sum("pages_processed")) and return
result.get("total_pages") — keep the aggregate on the same queryset (using Sum
and "pages_processed") so empty querysets yield None without the extra exists()
query.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 25de508 and adf2dc2.

📒 Files selected for processing (2)
  • backend/usage_v2/helper.py
  • backend/workflow_manager/workflow_v2/models/execution.py

Comment on lines +261 to +277
@property
def aggregated_total_pages_processed(self) -> int | None:
"""Retrieve aggregated total pages processed for this execution.

Returns:
int | None: Total pages processed across all file executions,
or None if no page usage data exists.
"""
from usage_v2.helper import UsageHelper

file_execution_ids = list(self.file_executions.values_list("id", flat=True))
if not file_execution_ids:
return None

return UsageHelper.get_aggregated_pages_processed(
run_ids=[str(fid) for fid in file_execution_ids]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

N+1 query risk when aggregated_total_pages_processed is serialized in a list endpoint.

Each call to this property issues at minimum 2 queries (1 for file_executions.values_list, 1 for PageUsage.aggregate). The AI summary confirms this property is exposed in ExecutionSerializer for the execution list API, making this O(2N) extra queries for N executions — on top of any existing per-object properties already doing the same.

Unlike aggregated_usage_cost (which filters Usage directly by execution_id), this property must first resolve file_execution_ids and then fan out to PageUsage, because there is no direct execution_id column on PageUsage (only run_id, which maps to a file execution ID). The fan-out is structurally inherent to the data model.

Mitigation options to consider:

  1. Add a DB index on PageUsage.run_id to at least make each aggregate query fast.
  2. Batch-load in the serializer: override the list view's queryset to prefetch/annotate page totals per execution, bypassing the per-object property for list responses.
  3. Accept the cost if the list endpoint is paginated tightly (e.g., page size ≤ 10) and usage is low, but make this explicit.

Run the following to confirm there's no existing index on PageUsage.run_id:

#!/bin/bash
# Confirm PageUsage model's Meta.indexes and any migrations that add an index on run_id
rg -n "run_id" --type py -C3 backend/account_usage/models.py

# Also check migrations for any index on page_usage.run_id
rg -rn "page_usage" --type py -g "**/migrations/**" -C2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/workflow_manager/workflow_v2/models/execution.py` around lines 261 -
277, The aggregated_total_pages_processed property issues per-object DB queries
(file_executions.values_list(...) and
UsageHelper.get_aggregated_pages_processed(...)) causing an N+1 when used in
ExecutionSerializer list endpoints; fix by either adding a DB index on
PageUsage.run_id to speed each aggregate query and adding a migration for that
index, or (preferred) batch the totals in the list view/serializer by computing
aggregated page totals for all execution IDs in one query and attaching them to
the queryset (override the list view or ExecutionSerializer to accept a
precomputed map keyed by execution id and avoid calling
aggregated_total_pages_processed per instance); reference the property
aggregated_total_pages_processed, the call to file_executions.values_list, and
UsageHelper.get_aggregated_pages_processed when implementing the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants