Skip to content

feat(NET-92) Handle missing parquet columns as nulls#59

Merged
define-null merged 6 commits intomasterfrom
defnull/net-92-add-default-null-columns-support
Mar 16, 2026
Merged

feat(NET-92) Handle missing parquet columns as nulls#59
define-null merged 6 commits intomasterfrom
defnull/net-92-add-default-null-columns-support

Conversation

@define-null
Copy link
Contributor

@define-null define-null commented Mar 10, 2026

Contributes: https://linear.app/sqd-ai/issue/NET-92/correctly-propagate-errors-from-the-query-engine#comment-fb1dc348

What is this PR about?

Graceful handling of missing columns in parquet files. When a query requests a field that doesn't exist in the underlying parquet data, the system now returns null values instead of failing.

How does it work?

  • Scan builder accepts a list of default-null columns via with_default_null_columns(). When a projected column is missing from the parquet file, it is injected as a NullArray into the resulting RecordBatch.
  • Table gains a set_nullable() method for declaring which columns may be absent. Currently all field columns are marked nullable via columns() generated by the item_field_selection! macro.
  • ChunkWithDefaults wrapper implements Chunk and transparently attaches default-null column info to every scan_table() call — covering both direct scans and relation lookups.
  • Based on discussion with @tmcgroul and @kalabukdima exclude authorization_list for now from the nullable column list.
  • Added optional tracing support for query crate

Limitations

There is no reliable source of schema information today. The schema may vary within a single dataset and is commonly different across datasets of the same kind. As a result, we cannot precisely declare which columns are truly nullable — instead, all field columns are currently marked as such. This is a temporary mitigation as discussed with @kalabukdima: queries will return nulls for missing columns rather than error, but the proper fix requires a well-defined schema source.

update: As discussed with @kalabukdima and @tmcgroul we would activate null-able columns feature when information per dataset would be available (currently disabled). For now convert erro to distinct error, to be able to return bad request from the worker.

TableReader` trait** (`reader.rs`) — Added `default_null_columns: Option<&HashSet<Name>>` parameter to `read()`.

**`Scan`** (`scan.rs`) — Added `default_null_columns` field and `with_default_null_columns()` builder method. Passes it through to `reader.read()`.

**`ParquetFile::read()`** (`parquet/file.rs`) — In Stage 3, columns in `default_null_columns` that are missing from the parquet schema are skipped instead of erroring. After reading (Stage 4), `NullArray` columns are injected for them into every record batch. This handles both projection and predicate columns — predicates will see NullArrays and naturally evaluate to false/null for comparisons.

**`SnapshotTableReader::read()`** (`storage/reader.rs`) — Accepts the new parameter (unused for now since storage tables are expected to always have all columns).

**`execute_output`** (`plan.rs`) — Simplified to use `scan.with_default_null_columns()` instead of manual missing-column detection and null-array injection.
…upport default-null columns automatically across all phases
@define-null define-null changed the title feat(NET-92) add default null columns support feat(NET-92) Handle missing parquet columns as nulls Mar 10, 2026
@define-null define-null requested a review from kalabukdima March 10, 2026 15:58
@define-null define-null requested a review from kalabukdima March 16, 2026 09:33
@define-null define-null merged commit ad62dc2 into master Mar 16, 2026
1 check passed
define-null added a commit to subsquid/worker-rs that referenced this pull request Mar 16, 2026
…encies to use nullable columnds (#53)

Contributes:
https://linear.app/sqd-ai/issue/NET-92/correctly-propagate-errors-from-the-query-engine#comment-fb1dc348

**What is this PR about?**

- ~Graceful handling of missing columns in parquet files. When a query
requests a field that doesn't exist in the underlying parquet data, the
system now returns null values instead of failing.~ Main work is done
here: subsquid/data#59

- Missing chunk tables are now treated as BadRequest, as per
@kalabukdima request
- Missing columnd are now treated as BadRequest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants