Feature/snowflake s3 stage operations by abhishek-pattern · Pull Request #17 · patterninc/ds-platform-utils

abhishek-pattern · 2026-02-12T11:33:43Z

No description provided.

- Add query_pandas_from_snowflake_via_s3_stage() for efficient large query results (>10M rows) - Add publish_pandas_via_s3_stage() for efficient large DataFrame writes (>10M rows) - Add make_batch_predictions_from_snowflake_via_s3_stage() for batch ML predictions - Support dev/prod environment switching via current.is_production - Add helper functions for S3 operations and SQL generation - Add metaflow_s3/utils.py with S3 utility functions - Add comprehensive functional tests - Integrate with existing Metaflow card system and cost tracking

… functional tests

… generation

…r schema from DataFrame

…te table creation logic

…function for improved readability and maintainability

…for improved clarity and performance

…dation

…in batch inference

…using S3 file retrieval function

…ient assignment for improved clarity

…for improved performance

…ocessing

…tegrate with multiprocessing

…erformance

…roved parallel processing

…rallel_map

…r for improved parallel processing

…ved concurrency in batch inference

…e and temporary file handling

…e function

…remove unused functions

… improve logging

… step

…tep and ensuring all futures complete

… inference

…e retrieval

… with polars

…eval

…t to streamline code

… via S3

…mmand

Copilot

Pull request overview

This PR implements S3 stage operations for Snowflake data transfer, providing an alternative to direct Snowflake data loading that's more efficient for large datasets. The feature adds support for exporting data from Snowflake to S3 and importing from S3 to Snowflake using Snowflake's external stage functionality.

Changes:

Added S3 stage configuration constants for dev and prod environments
Implemented S3 operations module with functions for reading/writing DataFrames via S3
Extended publish_pandas and query_pandas_from_snowflake with use_s3_stage parameter
Added new batch_inference function for distributed model inference using S3 staging

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
`src/ds_platform_utils/metaflow/_consts.py`	Adds S3 bucket and Snowflake stage configuration constants for dev/prod environments
`src/ds_platform_utils/metaflow/s3.py`	New module implementing S3 read/write operations for DataFrames using Metaflow S3 client
`src/ds_platform_utils/metaflow/pandas.py`	Adds S3 stage support to publish_pandas and query_pandas_from_snowflake functions with schema inference
`src/ds_platform_utils/metaflow/batch_inference.py`	New function for parallel batch inference using Snowflake → S3 → model → S3 → Snowflake pipeline
`tests/functional_tests/metaflow/test__pandas_s3.py`	Test flow for S3 stage operations with schema and without schema

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-12T11:42:10Z

src/ds_platform_utils/metaflow/batch_inference.py

+    parallelism: int = 1,
+    warehouse: Optional[str] = None,
+    ctx: Optional[dict] = None,
+):


Missing docstring. The batch_inference function is a public function (no leading underscore) that lacks a docstring explaining its purpose, parameters, return value, and usage. This is inconsistent with other public functions in the codebase like publish_pandas and query_pandas_from_snowflake which have comprehensive docstrings. Add a docstring documenting all parameters and the function's behavior.

Suggested change

):

):

"""

Run a batch inference workflow from Snowflake via S3 and write predictions back to Snowflake.

The function executes a Snowflake query, exports the result to an S3-backed stage in

batches, applies the provided ``model_predictor_function`` to each batch, writes the

predictions back to S3, and then loads the results into a Snowflake table.

Parameters

----------

input_query:

Either a SQL query string or a ``pathlib.Path``/string pointing to a file

containing a SQL query. The query may contain ``{schema}`` and additional

placeholders that are populated using the current environment and the

``ctx`` mapping.

output_table_name:

Name of the target Snowflake table that will receive the prediction results.

The table is created or overwritten in the active schema, depending on the

behavior implemented by the underlying Snowflake utilities.

model_predictor_function:

A callable that takes a Pandas ``DataFrame`` containing a batch of input rows

from Snowflake and returns a ``DataFrame`` with the corresponding predictions.

The returned columns must match the schema expected for the output table.

output_table_schema:

Optional explicit schema for the output table as a list of ``(column_name,

snowflake_type)`` tuples. If omitted, the schema may be inferred from the

``DataFrame`` returned by ``model_predictor_function``.

use_utc:

Whether to use UTC when generating timestamps or time-based metadata used in

the batch inference process.

batch_size_in_mb:

Approximate size, in megabytes, of each batch of data exported from Snowflake

to S3 and processed by the predictor function.

parallelism:

Number of parallel worker threads used to process batches and compute

predictions.

warehouse:

Optional Snowflake warehouse name to use for executing the query and loading

results. If ``None``, the default warehouse for the environment is used.

ctx:

Optional dictionary of additional values to substitute into the SQL query

template in addition to the ``schema`` placeholder.

Returns

-------

None

This function performs side effects (Snowflake queries, S3 uploads/downloads,

and Metaflow card updates) and does not return a value.

"""

src/ds_platform_utils/metaflow/s3.py

Copilot · 2026-02-12T11:42:11Z

tests/functional_tests/metaflow/test__pandas_s3.py

+            ],
+        )
+
+        self.next(self.test_publish_pandas_with_warehouse)


The step flow is broken. Line 48 calls self.test_publish_pandas_with_warehouse but this method doesn't exist. The next test method is named test_publish_pandas_without_schema. This should be self.next(self.test_publish_pandas_without_schema) to match the method defined at line 51.

Copilot · 2026-02-12T11:42:11Z

src/ds_platform_utils/metaflow/batch_inference.py

+def batch_inference(  # noqa: PLR0913, PLR0915
+    input_query: Union[str, Path],
+    output_table_name: str,
+    model_predictor_function: Callable[[pd.DataFrame], pd.DataFrame],
+    output_table_schema: Optional[List[Tuple[str, str]]] = None,
+    use_utc: bool = True,
+    batch_size_in_mb: int = 128,
+    parallelism: int = 1,
+    warehouse: Optional[str] = None,
+    ctx: Optional[dict] = None,
+):


New public function not exported in __init__.py. The batch_inference function is a new public API function (no leading underscore) but it's not added to the __all__ list in src/ds_platform_utils/metaflow/__init__.py. This means users cannot import it using from ds_platform_utils.metaflow import batch_inference. Either add it to the exports or mark it as private with a leading underscore if it's intended for internal use only.

src/ds_platform_utils/metaflow/pandas.py

src/ds_platform_utils/metaflow/s3.py

Copilot · 2026-02-12T11:42:13Z

src/ds_platform_utils/metaflow/pandas.py

+def _generate_snowflake_to_s3_copy_query(
+    query: str,
+    snowflake_stage_path: str,
+    file_name: str = "data.parquet",


The file_name parameter is defined but never used in the function body. The COPY INTO command generates files with its own naming convention and doesn't use this parameter. Either remove this unused parameter or implement it in the COPY INTO command if file naming control is needed.

src/ds_platform_utils/metaflow/s3.py

src/ds_platform_utils/metaflow/pandas.py

…e via S3

abhishek-pattern added 30 commits February 5, 2026 10:43

feat: implement S3 stage operations for Snowflake integration and add…

2afeb6e

… functional tests

feat: enhance Snowflake to S3 operations with configuration and query…

90be5e7

… generation

feat: add S3 to Snowflake data transfer functionality

4384700

feat: add S3 data folder constant for Snowflake stage operations

179c950

refactor: remove outdated functional tests for pandas via S3 stage

e1c4a0e

feat: add batch inference function for Snowflake integration and infe…

09c172e

…r schema from DataFrame

fix: add missing import for listing files in S3 for batch inference

ed96149

feat: add batch size parameter to Snowflake to S3 copy query and upda…

fa9eb51

…te table creation logic

feat: enhance batch inference with multiprocessing for file processing

3366ead

feat: refactor file processing in batch inference to use a dedicated …

535cc71

…function for improved readability and maintainability

feat: refactor file processing in batch inference to inline function …

58b52f4

…for improved clarity and performance

feat: add function to download all files from S3 folder with URI vali…

bac2b1d

…dation

feat: replace file listing with direct download of all files from S3 …

ea0544c

…in batch inference

feat: update input file processing to read parquet format instead of …

bb65911

…using S3 file retrieval function

feat: refactor _download_all_files_in_s3_folder to use a direct S3 cl…

43bfb41

…ient assignment for improved clarity

feat: optimize batch inference file processing using multiprocessing …

172278a

…for improved performance

feat: add picklability check for process_file function before multipr…

82f30af

…ocessing

feat: refactor process_file function for improved picklability and in…

4b8712d

…tegrate with multiprocessing

feat: replace itertools.partial with functools.partial for improved p…

b7624cc

…erformance

feat: replace multiprocessing.Pool with metaflow.parallel_map for imp…

ccc1e29

…roved parallel processing

feat: refactor process_file function for improved integration with pa…

c7f9b5b

…rallel_map

feat: replace parallel_map with concurrent.futures.ProcessPoolExecuto…

70984bd

…r for improved parallel processing

feat: switch from ProcessPoolExecutor to ThreadPoolExecutor for impro…

7fe6bf4

…ved concurrency in batch inference

feat: enhance batch inference and S3 integration with schema inferenc…

4bab774

…e and temporary file handling

feat: remove redundant import of ThreadPoolExecutor in batch_inferenc…

139fc89

…e function

feat: streamline S3 file handling with temporary file management and …

cfa7831

…remove unused functions

feat: add logging for data export process in batch inference

7740a61

feat: add timing logs for data export and batch inference processes

bfd8dcb

feat: update batch file processing to handle single S3 file input and…

e66fcda

… improve logging

abhishek-pattern added 10 commits February 12, 2026 11:58

feat: fix data retrieval from S3 by removing unnecessary parquet read…

ea250c6

… step

feat: optimize batch inference by removing unnecessary parquet read s…

b23c845

…tep and ensuring all futures complete

feat: update batch size handling and improve file processing in batch…

789b5b6

… inference

feat: add polars dependency for enhanced data processing capabilities

4172057

feat: switch from pandas to polars for improved performance in S3 fil…

ad87d29

…e retrieval

feat: remove pandas_via_s3_stage module to streamline data processing…

a5662b7

… with polars

feat: remove polars dependency and revert to pandas for S3 file retri…

c7d3e71

…eval

feat: remove polars dependency from project requirements

65a70bb

feat: remove make_batch_predictions_from_snowflake_via_s3_stage impor…

9ddef7f

…t to streamline code

feat: add Metaflow flow for publishing and querying pandas DataFrames…

368dcc2

… via S3

Copilot AI review requested due to automatic review settings February 12, 2026 11:33

Copilot started reviewing on behalf of abhishek-pattern February 12, 2026 11:34 View session

abhishek-pattern added 2 commits February 12, 2026 17:05

Refactor code structure for improved readability and maintainability

17997ab

feat: add pytest-xdist for parallel test execution and update test co…

675ca8f

…mmand

Copilot AI reviewed Feb 12, 2026

View reviewed changes

abhishek-pattern added 4 commits February 12, 2026 17:18

fix: update flow steps in test to ensure correct execution order

393abaf

fix: update table name and query for publishing and querying DataFram…

d0a231c

…e via S3

fix: correct SQL syntax for creating temporary file format in Snowflake

71b3169

fix: correct SQL syntax for creating temporary file format in Snowflake

d023186

abhishek-pattern closed this Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Feature/snowflake s3 stage operations#17

Feature/snowflake s3 stage operations#17
abhishek-pattern wants to merge 46 commits intomainfrom
feature/snowflake-s3-stage-operations

abhishek-pattern commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-):
+):
+    """
+    Run a batch inference workflow from Snowflake via S3 and write predictions back to Snowflake.
+    The function executes a Snowflake query, exports the result to an S3-backed stage in
+    batches, applies the provided ``model_predictor_function`` to each batch, writes the
+    predictions back to S3, and then loads the results into a Snowflake table.
+    Parameters
+    ----------
+    input_query:
+        Either a SQL query string or a ``pathlib.Path``/string pointing to a file
+        containing a SQL query. The query may contain ``{schema}`` and additional
+        placeholders that are populated using the current environment and the
+        ``ctx`` mapping.
+    output_table_name:
+        Name of the target Snowflake table that will receive the prediction results.
+        The table is created or overwritten in the active schema, depending on the
+        behavior implemented by the underlying Snowflake utilities.
+    model_predictor_function:
+        A callable that takes a Pandas ``DataFrame`` containing a batch of input rows
+        from Snowflake and returns a ``DataFrame`` with the corresponding predictions.
+        The returned columns must match the schema expected for the output table.
+    output_table_schema:
+        Optional explicit schema for the output table as a list of ``(column_name,
+        snowflake_type)`` tuples. If omitted, the schema may be inferred from the
+        ``DataFrame`` returned by ``model_predictor_function``.
+    use_utc:
+        Whether to use UTC when generating timestamps or time-based metadata used in
+        the batch inference process.
+    batch_size_in_mb:
+        Approximate size, in megabytes, of each batch of data exported from Snowflake
+        to S3 and processed by the predictor function.
+    parallelism:
+        Number of parallel worker threads used to process batches and compute
+        predictions.
+    warehouse:
+        Optional Snowflake warehouse name to use for executing the query and loading
+        results. If ``None``, the default warehouse for the environment is used.
+    ctx:
+        Optional dictionary of additional values to substitute into the SQL query
+        template in addition to the ``schema`` placeholder.
+    Returns
+    -------
+    None
+        This function performs side effects (Snowflake queries, S3 uploads/downloads,
+        and Metaflow card updates) and does not return a value.
+    """

Comments

Conversation

abhishek-pattern commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant