Skip to content

[FEATURE] Add Union command in PPL #5240

Open
srikanthpadakanti wants to merge 4 commits intoopensearch-project:mainfrom
srikanthpadakanti:feature/union-public
Open

[FEATURE] Add Union command in PPL #5240
srikanthpadakanti wants to merge 4 commits intoopensearch-project:mainfrom
srikanthpadakanti:feature/union-public

Conversation

@srikanthpadakanti
Copy link
Copy Markdown
Contributor

Description

Add union command to PPL that implements SQL-style UNION ALL semantics with Calcite-based type coercion. The command supports combining multiple datasets (indices, patterns, aliases, or subsearches) with automatic schema merging and missing fields are filled with NULL, compatible types are coerced to a common supertype (e.g., int+float --> float), and incompatible types fall back to string. Works both as a first command and mid-pipeline, where the upstream result set is implicitly included as the first dataset.

Related Issues

Resolves #5110
#5110

Check List

  • [ X ] New functionality includes testing.
  • [ X ] New functionality has been documented.
  • [ X ] New functionality has javadoc added.
  • [ X ] New functionality has a user manual doc added.
  • [ X ] New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • [ X ] Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 17, 2026

PR Reviewer Guide 🔍

(Review updated until commit e03a298)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: PPL Union AST parsing, anonymization, and grammar support

Relevant files:

  • ppl/src/main/antlr/OpenSearchPPLLexer.g4
  • ppl/src/main/antlr/OpenSearchPPLParser.g4
  • ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java
  • core/src/main/java/org/opensearch/sql/ast/tree/Union.java
  • core/src/main/java/org/opensearch/sql/ast/AbstractNodeVisitor.java
  • core/src/main/java/org/opensearch/sql/analysis/Analyzer.java
  • ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java
  • ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/NewAddedCommandsIT.java

Sub-PR theme: Calcite Union execution, schema unification, and integration tests

Relevant files:

  • core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLUnionTest.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteUnionCommandIT.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/CalciteNoPushdownIT.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/security/CalciteCrossClusterSearchIT.java
  • integ-test/src/test/resources/expectedOutput/calcite/explain_union.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_union.yaml

Sub-PR theme: Union command documentation

Relevant files:

  • docs/user/ppl/cmd/union.md
  • docs/user/ppl/index.md
  • docs/category.json

⚡ Recommended focus areas for review

Case-Sensitive Field Matching

The buildUnifiedSchemaForUnion and buildProjectionForUnion methods use field names as keys in a HashMap without case normalization. If two datasets have fields with the same name but different casing (e.g., "Age" vs "age"), they will be treated as distinct fields rather than being unified, potentially producing unexpected schema results.

private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
  List<SchemaField> schema = new ArrayList<>();
  Map<String, RelDataType> seenFields = new HashMap<>();

  for (RelNode node : nodes) {
    for (RelDataTypeField field : node.getRowType().getFieldList()) {
      if (!seenFields.containsKey(field.getName())) {
        schema.add(new SchemaField(field.getName(), field.getType()));
        seenFields.put(field.getName(), field.getType());
      }
    }
  }
  return schema;
}

private static List<RexNode> buildProjectionForUnion(
    RelNode node, List<SchemaField> unifiedSchema, CalcitePlanContext context) {
  Map<String, RelDataTypeField> nodeFieldMap =
      node.getRowType().getFieldList().stream()
          .collect(Collectors.toMap(RelDataTypeField::getName, field -> field));

  List<RexNode> projection = new ArrayList<>();
  for (SchemaField schemaField : unifiedSchema) {
    RelDataTypeField nodeField = nodeFieldMap.get(schemaField.getName());

    if (nodeField != null) {
      RexNode fieldRef = context.rexBuilder.makeInputRef(node, nodeField.getIndex());
      if (!nodeField.getType().equals(schemaField.getType())) {
        projection.add(context.rexBuilder.makeCast(schemaField.getType(), fieldRef));
      } else {
        projection.add(fieldRef);
      }
    } else {
      projection.add(context.rexBuilder.makeNullLiteral(schemaField.getType()));
    }
  }
  return projection;
DECIMAL Rank Ordering

In getNumericTypeRankForUnion, DECIMAL is ranked 5 (below REAL=6, FLOAT=7, DOUBLE=8). This means DECIMAL+DOUBLE would coerce to DOUBLE, potentially losing precision for high-precision decimal values. DECIMAL is a fixed-precision type and may not be safely widened to floating-point types without precision loss.

private static int getNumericTypeRankForUnion(SqlTypeName typeName) {
  return switch (typeName) {
    case TINYINT -> 1;
    case SMALLINT -> 2;
    case INTEGER -> 3;
    case BIGINT -> 4;
    case DECIMAL -> 5;
    case REAL -> 6;
    case FLOAT -> 7;
    case DOUBLE -> 8;
    default -> 0;
  };
}
Regex Keyword List

The anonymizeSubsearchCommand method builds a keyword exclusion regex that includes the commandName parameter (e.g., "union" or "multisearch"). However, the regex pattern is constructed by string concatenation without escaping, so if a command name contains regex special characters, it could break the pattern. Additionally, the keyword list may be incomplete for new commands added in the future.

String keywords =
    "source|fields|where|stats|head|tail|sort|eval|rename|"
        + commandName
        + "|search|table|identifier|\\*\\*\\*";
List<String> anonymizedSubsearches = new ArrayList<>();

for (UnresolvedPlan subsearch : subsearches) {
  String anonymizedSubsearch = anonymizeData(subsearch);
  anonymizedSubsearch = "search " + anonymizedSubsearch;
  anonymizedSubsearch =
      anonymizedSubsearch
          .replaceAll("\\bsource=\\w+", "source=table")
          .replaceAll("\\b(?!" + keywords + ")\\w+(?=\\s*[<>=!])", "identifier")
          .replaceAll("\\b(?!" + keywords + ")\\w+(?=\\s*,)", "identifier")
          .replaceAll("fields \\+\\s*\\b(?!" + keywords + ")\\w+", "fields + identifier")
          .replaceAll(
              "fields \\+\\s*identifier,\\s*\\b(?!" + keywords + ")\\w+",
              "fields + identifier,identifier");
  anonymizedSubsearches.add(StringUtils.format("[%s]", anonymizedSubsearch));
}

return StringUtils.format("| %s %s", commandName, String.join(" ", anonymizedSubsearches));
Single Dataset Validation

The check if (inputNodes.size() < 2) in visitUnion happens after building all RelNodes. If a single dataset is provided as a first command (no upstream), the error is thrown at planning time rather than parse time. The comment in AstBuilder.java says "Allow 1+ here; total count (including implicit upstream) validated during planning" — but the error message says "Union command requires at least two datasets. Provided: X" which may be confusing when X=1 because the upstream was empty/null rather than missing.

if (inputNodes.size() < 2) {
  throw new IllegalArgumentException(
      "Union command requires at least two datasets. Provided: " + inputNodes.size());
}
Temporal Coercion Logic

In getWiderTemporalTypeForUnion, TIMESTAMP takes precedence over TIMESTAMP_WITH_LOCAL_TIME_ZONE. This means DATE+TIMESTAMP_WITH_LOCAL_TIME_ZONE coerces to TIMESTAMP_WITH_LOCAL_TIME_ZONE, but TIMESTAMP+TIMESTAMP_WITH_LOCAL_TIME_ZONE coerces to TIMESTAMP (losing timezone info). The ordering may cause unexpected loss of timezone information.

private static SqlTypeName getWiderTemporalTypeForUnion(SqlTypeName type1, SqlTypeName type2) {
  if (type1 == SqlTypeName.TIMESTAMP || type2 == SqlTypeName.TIMESTAMP) {
    return SqlTypeName.TIMESTAMP;
  }
  if (type1 == SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE
      || type2 == SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE) {
    return SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE;
  }
  return SqlTypeName.DATE;
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 17, 2026

PR Code Suggestions ✨

Latest suggestions up to e03a298

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Include all command names in anonymizer keyword list

The keywords regex negative lookahead is built by concatenating commandName directly
into the regex pattern. If commandName contains regex special characters, this could
break the pattern. Additionally, when commandName is "union", the keyword list will
include "union" but not "multisearch", and vice versa. This means field names that
happen to match the other command name could be incorrectly anonymized. Both command
names should always be included in the keywords list.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [805-809]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
+          + "multisearch|union"
           + "|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that when commandName is "union", the keyword "multisearch" is excluded from the negative lookahead, and vice versa. This could cause field names matching the other command name to be incorrectly anonymized. Including both command names always is a valid correctness fix.

Low
Simplify constructor annotations for clarity

Both @RequiredArgsConstructor and @AllArgsConstructor are present.
@RequiredArgsConstructor generates a constructor with only datasets (the final
field), while @AllArgsConstructor generates one with both datasets and maxout.
However, maxout is not final, so @RequiredArgsConstructor will only include
datasets. This means there are two constructors: Union(List) and Union(List,
Integer). The attach method uses new Union(newDatasets, maxout) which calls the
all-args constructor. This is fine, but having both annotations is redundant since
@AllArgsConstructor already covers the single-field case when maxout is null.
Consider removing @RequiredArgsConstructor to avoid confusion, or make maxout final
with a default.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
Suggestion importance[1-10]: 3

__

Why: Having both @RequiredArgsConstructor and @AllArgsConstructor is redundant but not incorrect since they generate different constructors. The suggestion to consolidate is a minor code quality improvement, but the current code works correctly as-is.

Low
Preserve insertion order in unified schema map

The unified schema is built using only the first occurrence of each field name,
ignoring type differences across inputs. After coerceUnionTypes, all inputs should
have the same type for shared fields, but the schema should reflect the coerced
(target) type rather than the first-seen type. If coercion didn't fully align types,
the projection in buildProjectionForUnion will cast, but the schema type used for
NULL literals for missing fields may be wrong. The schema should be derived from the
coerced inputs' field types to ensure correctness.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
-  Map<String, RelDataType> seenFields = new HashMap<>();
+  Map<String, RelDataType> seenFields = new LinkedHashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        seenFields.put(name, field.getType());
+        schema.add(new SchemaField(name, field.getType()));
       }
+      // After coercion, all nodes should agree on the type; no update needed
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 2

__

Why: The improved_code is functionally identical to the existing_code - it still uses HashMap in the signature description but the actual logic change (using LinkedHashMap) is minor and the comment "After coercion, all nodes should agree on the type; no update needed" doesn't add meaningful behavior change. The suggestion's concern about type correctness is addressed by the prior coerceUnionTypes step, making this a very low-impact change.

Low
Possible issue
Avoid nested build inside push argument

After pushing the LogicalSystemLimit node onto the builder,
context.relBuilder.peek() is called to return the top node. However,
LogicalSystemLimit was pushed directly via context.relBuilder.push(...) and then
context.relBuilder.build() was called inside the push argument, which pops the union
node. The LogicalSystemLimit is pushed but peek() should correctly return it.
However, if maxout is null, context.relBuilder.peek() is called after union(...)
which is correct. Verify that after the union(true, size) call, the builder's stack
has exactly one node before peek() is called, since multiple push calls were made
for each input.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2634-2642]

 if (node.getMaxout() != null) {
+  RelNode unionNode = context.relBuilder.build();
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
-          context.relBuilder.build(),
+          unionNode,
           context.relBuilder.literal(node.getMaxout())));
 }
 
 return context.relBuilder.peek();
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a potential issue with calling context.relBuilder.build() inside the argument to context.relBuilder.push(), which could cause ordering/evaluation issues. Extracting build() to a separate variable before passing it to push() is cleaner and avoids any side-effect ordering concerns.

Low

Previous suggestions

Suggestions up to commit f64bacb
CategorySuggestion                                                                                                                                    Impact
General
Fix regex keyword list to include all command names

The keywords regex is built by concatenating commandName directly into the
alternation pattern. If commandName contains regex special characters, this could
break the pattern. Additionally, when commandName is "union", the keyword list won't
include "multisearch", and vice versa, potentially causing incorrect anonymization
of the other command name if it appears in a subsearch. Consider using a fixed,
comprehensive keyword list instead.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [805-809]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
-          + "|search|table|identifier|\\*\\*\\*";
+          + "multisearch|union|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the dynamic keywords string won't include "multisearch" when processing a "union" command and vice versa, which could cause incorrect anonymization. Using a fixed comprehensive keyword list is a valid improvement for correctness and maintainability.

Low
Isolate context per union dataset branch

Each dataset is visited using context, which is a shared mutable object. If
EmptySourcePropagateVisitor or the visitor modifies shared state in context (such as
relBuilder stack or other fields), processing one dataset may corrupt the state for
subsequent datasets. Consider creating a fresh or isolated context for each dataset
to avoid cross-contamination between union branches.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2615-2619]

 for (UnresolvedPlan dataset : node.getDatasets()) {
   UnresolvedPlan prunedDataset = dataset.accept(new EmptySourcePropagateVisitor(), null);
-  prunedDataset.accept(this, context);
-  inputNodes.add(context.relBuilder.build());
+  CalcitePlanContext datasetContext = context.createSubContext();
+  prunedDataset.accept(this, datasetContext);
+  inputNodes.add(datasetContext.relBuilder.build());
 }
Suggestion importance[1-10]: 4

__

Why: The concern about shared mutable context is valid in principle, but the improved_code references a context.createSubContext() method that may not exist in the codebase. The existing code uses context.relBuilder.build() to pop the built node, which is the standard pattern used elsewhere in the visitor, so the risk may be lower than suggested.

Low
Prevent accidental null maxout in constructors

Both @RequiredArgsConstructor and @AllArgsConstructor are present.
@RequiredArgsConstructor generates a constructor with only final fields (i.e.,
datasets), while @AllArgsConstructor generates one with both datasets and maxout.
However, maxout is not final, so @RequiredArgsConstructor produces a constructor
that doesn't set maxout, leaving it always null when using that constructor. This is
intentional for the no-maxout case, but the attach method always uses the two-arg
constructor, which is correct. Verify that the single-arg constructor usage is
intentional and won't silently ignore a maxout value set elsewhere.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
Suggestion importance[1-10]: 3

__

Why: The suggestion to make maxout final and use a single canonical constructor is a reasonable design improvement, but the current dual-annotation approach (@RequiredArgsConstructor + @AllArgsConstructor) is intentional and functional. The attach method correctly uses the two-arg constructor, so there's no actual bug here.

Low
Possible issue
Ensure unified schema uses coerced types

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, the field types in subsequent nodes may have
been coerced to a common type. The schema should reflect the coerced (common) type,
not just the first-seen type. This can cause type mismatches when projecting fields
from nodes where the coerced type differs from the first-seen type, leading to
incorrect CASTs or runtime errors.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
+      // After coercion, all nodes sharing a field name should have the same type already.
+      // No update needed since coerceUnionTypes already aligned types.
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims a type mismatch issue, but after coerceUnionTypes runs, all nodes sharing a field name already have the same coerced type. The buildUnifiedSchemaForUnion method correctly picks the first-seen type, which after coercion is the common type. The 'improved_code' is functionally identical to the 'existing_code' with only a comment added, making this a low-impact suggestion.

Low
Suggestions up to commit 811fc25
CategorySuggestion                                                                                                                                    Impact
General
Use complete keyword set in anonymizer regex

The keywords regex is built by concatenating commandName directly into the
alternation pattern. If commandName contains regex special characters (unlikely for
current values like "union" or "multisearch", but possible for future commands),
this could break the regex. Additionally, when visitMultisearch calls this method,
the keyword list will include "multisearch" but not "union", and vice versa —
meaning the other command name could be incorrectly anonymized as an identifier if
it appears as a field name. Consider using a fixed set of all known command keywords
rather than dynamically inserting only the current command name.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [805-809]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
-          + "|search|table|identifier|\\*\\*\\*";
+          + "multisearch|union|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid — using a fixed complete set of command keywords prevents cross-command anonymization issues and avoids potential regex injection. The improved code correctly includes both multisearch and union in the keyword list regardless of which command is being processed.

Low
Make maxout field immutable for consistency

Using both @RequiredArgsConstructor and @AllArgsConstructor together with a mix of
final and non-final fields creates two constructors: one with only datasets
(required) and one with both datasets and maxout. However, maxout is non-final and
mutable, which can lead to inconsistent state if the field is modified after
construction. Consider making maxout final and using only @AllArgsConstructor, or
using a builder pattern to avoid ambiguity.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
-
Suggestion importance[1-10]: 4

__

Why: Making maxout final and removing @RequiredArgsConstructor in favor of only @AllArgsConstructor is a valid immutability improvement. The current design with both annotations and a non-final field is somewhat inconsistent, though it works functionally since attach() creates a new Union instance.

Low
Ensure unified schema uses coerced common types

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, the field types in subsequent nodes may have
been coerced to a common supertype. The schema should reflect the coerced (common)
type, not just the first-seen type. This can cause type mismatches when projecting
fields in buildProjectionForUnion, leading to unnecessary or incorrect CASTs.

Consider building the unified schema by looking up the target type from the coercion
map (or by comparing all nodes' types for each field) rather than using the
first-seen type.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
+      // After coercion, all nodes sharing a field name should have the same type;
+      // the first-seen type is the coerced common type.
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion claims a bug where the first-seen type is used instead of the coerced type, but after coerceUnionTypes runs, all nodes sharing a field name already have the same coerced type. So the first-seen type IS the coerced type. The improved_code is functionally identical to the existing_code, making this suggestion incorrect.

Low
Possible issue
Fix inconsistent relBuilder state after union visit

After calling context.relBuilder.build() inside the if (node.getMaxout() != null)
block, the built node is passed to LogicalSystemLimit.create but then pushed back
via context.relBuilder.push(...). However, context.relBuilder.peek() is called at
the end regardless of whether maxout was applied. If maxout is null,
context.relBuilder.peek() returns the union node, which is correct. But if maxout is
not null, the LogicalSystemLimit node is pushed but never built — peek() returns it
correctly only if push was called. This is fine, but the final return
context.relBuilder.peek() is inconsistent with the pattern used elsewhere (which
typically calls build()). More critically, callers of visitUnion in visitUnion use
context.relBuilder.build() immediately after, so the peek/build mismatch could cause
double-build issues.
Ensure the method consistently returns the top node without
leaving the builder in an ambiguous state, by returning context.relBuilder.build()
and having the caller re-push if needed, or by aligning with the existing visitor
pattern.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2629-2642]

 for (RelNode input : unifiedInputs) {
   context.relBuilder.push(input);
 }
 context.relBuilder.union(true, unifiedInputs.size()); // true = UNION ALL
 
 if (node.getMaxout() != null) {
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
           context.relBuilder.build(),
           context.relBuilder.literal(node.getMaxout())));
 }
 
-return context.relBuilder.peek();
+return context.relBuilder.build();
Suggestion importance[1-10]: 5

__

Why: The suggestion to use context.relBuilder.build() instead of peek() is a valid concern about consistency with the visitor pattern. Looking at the caller in visitUnion which calls context.relBuilder.build() after prunedDataset.accept(this, context), using peek() vs build() could cause issues if the caller tries to build again. This is a moderate correctness concern.

Low
Suggestions up to commit b6879f9
CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure builder stack is consistently cleaned up

When maxout is null, context.relBuilder.build() is never called after the union, so
the union result remains on the builder stack. Then context.relBuilder.peek()
returns the top of the stack correctly, but the caller also calls
context.relBuilder.build() (as seen in the loop:
inputNodes.add(context.relBuilder.build())). This inconsistency may leave stale
nodes on the builder stack for subsequent operations. The method should consistently
return the built node and ensure the builder stack is clean.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2629-2642]

 for (RelNode input : unifiedInputs) {
   context.relBuilder.push(input);
 }
 context.relBuilder.union(true, unifiedInputs.size()); // true = UNION ALL
 
 if (node.getMaxout() != null) {
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
           context.relBuilder.build(),
           context.relBuilder.literal(node.getMaxout())));
 }
 
-return context.relBuilder.peek();
+return context.relBuilder.build();
Suggestion importance[1-10]: 7

__

Why: The suggestion raises a valid concern about stack consistency — when maxout is null, context.relBuilder.peek() leaves the union node on the stack, while the maxout branch calls build() which pops it. Using build() consistently at the end ensures the builder stack is always clean after visitUnion returns, preventing potential stale state issues in subsequent operations.

Medium
Use coerced field types in unified schema

The unified schema is built using only the first occurrence of each field name, but
after coerceUnionTypes runs, the field types may have been updated to a common
supertype. The schema should reflect the coerced type (from the target type map),
not just the first node's type. Otherwise, the subsequent projection in
buildProjectionForUnion may cast to the wrong type.
Consider rebuilding the unified
schema after coercion by using the coerced input nodes' types, which already reflect
the common supertype, ensuring the schema accurately represents the unified coerced
types.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
-  Map<String, RelDataType> seenFields = new HashMap<>();
+  Map<String, RelDataType> seenFields = new LinkedHashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
-      }
+      // After coercion, all nodes sharing a field name have the same type;
+      // use the first occurrence (which is already the coerced type).
+      seenFields.putIfAbsent(field.getName(), field.getType());
     }
   }
+  seenFields.forEach((name, type) -> schema.add(new SchemaField(name, type)));
   return schema;
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that buildUnifiedSchemaForUnion uses the first occurrence of a field's type, which after coerceUnionTypes should already be the coerced type. The improved code uses LinkedHashMap with putIfAbsent which is functionally equivalent to the existing logic but more idiomatic. The actual concern about type mismatch is valid but minor since coerceUnionTypes runs first and updates the node types before this method is called.

Low
General
Ensure Java version compatibility for list access

The coerceUnionTypes method builds fieldTypeMap by iterating all inputs, but only
adds a type if typeName != null. If a field appears in only one input (missing from
others), its entry in fieldTypeMap will have only one type, and commonType will be
that single type. However, the missing fields in other inputs are not accounted for
here — they will be handled by unifySchemasForUnion with NULL fills. This is correct
behavior, but the targetTypeMap will still contain an entry for that field,
potentially causing unnecessary casts for inputs that do have the field. This is a
minor correctness concern but not a bug. A more critical issue: types.getFirst()
(Java 21+) may not be available in all build environments; verify compatibility.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [249-252]

-SqlTypeName commonType = types.getFirst();
+SqlTypeName commonType = types.get(0);
 for (int i = 1; i < types.size(); i++) {
   commonType = findCommonTypeForUnion(commonType, types.get(i));
 }
Suggestion importance[1-10]: 4

__

Why: types.getFirst() is a Java 21+ API that may not be available in all build environments. Replacing it with types.get(0) is a safe, backward-compatible change. The suggestion is valid but the impact depends on the project's minimum Java version requirement.

Low
Fix misleading error message for empty datasets

The error message says "Provided: " + datasets.size(), but at this point datasets is
always empty (size 0) when the exception is thrown, making the message always say
"Provided: 0". This is misleading but minor. More importantly, the comment says
"Allow 1+ here; total count (including implicit upstream) validated during
planning", but the visitUnion in CalciteRelNodeVisitor throws when inputNodes.size()
< 2 — if the union is used mid-pipeline with one explicit dataset, the upstream is
attached via attach(), making datasets size 1. Ensure the planning-time validation
correctly accounts for the implicit upstream to avoid false errors.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java [1355-1369]

 for (OpenSearchPPLParser.UnionDatasetContext datasetCtx : ctx.unionDataset()) {
   if (datasetCtx.subSearch() != null) {
     datasets.add(visitSubSearch(datasetCtx.subSearch()));
   } else if (datasetCtx.tableSource() != null) {
     datasets.add(
         new Relation(
             Collections.singletonList(internalVisitExpression(datasetCtx.tableSource()))));
   }
 }
 
 // Allow 1+ here; total count (including implicit upstream) validated during planning
 if (datasets.isEmpty()) {
   throw new SyntaxCheckException(
-      "Union command requires at least one dataset. Provided: " + datasets.size());
+      "Union command requires at least one dataset. Provided: 0");
 }
Suggestion importance[1-10]: 2

__

Why: The error message issue is trivially minor — the size will always be 0 when thrown, so hardcoding "0" vs datasets.size() makes no practical difference. The improved_code is nearly identical to existing_code with only a cosmetic string change, offering negligible improvement.

Low
Suggestions up to commit 23f420d
CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid consuming builder stack inside push argument

After applying LogicalSystemLimit, the node is pushed onto the builder stack but
context.relBuilder.build() was already called inside the push(...) call, leaving the
limit node on the stack. However, when maxout is null, the union result was already
consumed by context.relBuilder.build() inside the limit block's inner call, but the
outer peek() would still work. The issue is that when maxout is not null, the
LogicalSystemLimit is pushed but never built/returned — peek() returns it correctly,
but the caller may call build() again. This is consistent with the non-maxout path,
so it's fine, but the code should be verified that context.relBuilder.build() inside
the push doesn't leave the stack in an inconsistent state when maxout is null vs
non-null.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2634-2642]

 if (node.getMaxout() != null) {
+  RelNode unionNode = context.relBuilder.build();
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
-          context.relBuilder.build(),
+          unionNode,
           context.relBuilder.literal(node.getMaxout())));
 }
 
 return context.relBuilder.peek();
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that calling context.relBuilder.build() inside the argument to push() could cause stack ordering issues. Extracting the build() call to a separate variable before passing it to LogicalSystemLimit.create() is a cleaner and safer pattern that avoids potential side effects on the builder stack.

Low
General
Include all command names in anonymizer keyword exclusion list

The keywords regex is built by concatenating commandName directly into the negative
lookahead pattern. If commandName contains regex special characters, this could
break the regex. Additionally, when commandName is "union", the keyword list does
not include "multisearch", and vice versa — meaning the other command name could be
incorrectly anonymized as an identifier. Both command names should always be
included in the keywords list.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [805-809]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
+          + "multisearch|union"
           + "|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that when commandName is "union", "multisearch" is missing from the keyword exclusion list and vice versa, which could cause command names to be incorrectly anonymized as identifiers. Hardcoding both command names ensures consistent behavior regardless of which command is being anonymized.

Low
Remove ambiguous constructor annotation combination

Both @RequiredArgsConstructor and @AllArgsConstructor are present.
@RequiredArgsConstructor generates a constructor with only final fields (i.e.,
datasets), while @AllArgsConstructor generates one with both datasets and maxout.
However, maxout is not final, so @RequiredArgsConstructor produces a constructor
Union(List datasets) which conflicts with the intent. If maxout should always be set
via the two-arg constructor, consider making maxout final or removing
@RequiredArgsConstructor to avoid ambiguity.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
 
   private Integer maxout;
 
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
+
Suggestion importance[1-10]: 4

__

Why: The combination of @RequiredArgsConstructor and @AllArgsConstructor is intentional here — @RequiredArgsConstructor generates Union(List<UnresolvedPlan> datasets) and @AllArgsConstructor generates Union(List<UnresolvedPlan> datasets, Integer maxout), which are both used in the codebase. The suggestion to replace them with an explicit constructor is valid for clarity but the existing code is functionally correct.

Low
Unified schema uses post-coercion types correctly

The unified schema is built using only the type from the first node that introduces
a field, but after coerceUnionTypes runs, subsequent nodes may have had their field
types changed to a common supertype. The schema should reflect the coerced (common)
type, not just the first-seen type. This can cause type mismatches when projecting
fields from later nodes that were coerced to a different type than what the first
node had.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
+      // After coercion, all nodes sharing a field name should have the same type already
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims a type mismatch issue, but after coerceUnionTypes runs, all nodes sharing a field name already have the same coerced type. The buildUnifiedSchemaForUnion method correctly picks the first-seen type, which after coercion is the common supertype. The improved code is functionally identical to the existing code, making this suggestion of minimal value.

Low
Suggestions up to commit 9ec4314
CategorySuggestion                                                                                                                                    Impact
General
Include all command names in anonymizer keyword exclusion list

The keywords regex is built by concatenating commandName directly into a regex
alternation. If commandName contains regex special characters (unlikely for current
values like "union" or "multisearch", but possible for future commands), this could
break the regex. Additionally, the keyword list does not include union when
anonymizing multisearch commands and vice versa, which could cause command names to
be incorrectly replaced with identifier. Both command names should always be in the
exclusion list.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [790-794]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
+          + "union|multisearch"
           + "|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: This is a valid bug — when anonymizing a union command, the keyword multisearch is not in the exclusion list (and vice versa), which could cause command names appearing in subsearch content to be incorrectly replaced with identifier. The fix is straightforward and correct, hardcoding both command names instead of dynamically inserting only the current one.

Low
Validate dataset count before processing inputs

When union is used mid-pipeline (e.g., search source=X | union [...]), the upstream
plan is attached via attach() and becomes part of node.getDatasets(). However, if
only one explicit dataset is provided in the grammar (which is allowed by the
parser), inputNodes will have size 1 after the loop, and the error is thrown after
already consuming the relBuilder state. The size check should happen before
processing datasets, using the total expected count (datasets + possible implicit
upstream), or the grammar should enforce at least 2 datasets at parse time.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2472-2481]

+// Validate before processing to avoid partial relBuilder state corruption
+// When used mid-pipeline, attach() prepends the upstream, so datasets.size() >= 1 is valid
+// but we still need at least 2 total inputs for a union
+if (node.getDatasets().size() < 2) {
+  throw new IllegalArgumentException(
+      "Union command requires at least two datasets. Provided: " + node.getDatasets().size());
+}
+
 for (UnresolvedPlan dataset : node.getDatasets()) {
   UnresolvedPlan prunedDataset = dataset.accept(new EmptySourcePropagateVisitor(), null);
   prunedDataset.accept(this, context);
   inputNodes.add(context.relBuilder.build());
 }
 
-if (inputNodes.size() < 2) {
-  throw new IllegalArgumentException(
-      "Union command requires at least two datasets. Provided: " + inputNodes.size());
-}
-
Suggestion importance[1-10]: 4

__

Why: Moving the validation before processing avoids partial relBuilder state corruption, which is a valid concern. However, the mid-pipeline case (where attach() prepends the upstream) means node.getDatasets() already contains all datasets including the implicit upstream, so the check at inputNodes.size() < 2 after the loop is functionally equivalent. The suggestion has merit but the impact is limited.

Low
Remove redundant constructor annotation

The class has both @RequiredArgsConstructor and @AllArgsConstructor. Since datasets
is final and maxout is not, @RequiredArgsConstructor generates a constructor with
only datasets, while @AllArgsConstructor generates one with both. However, the
attach method calls new Union(newDatasets, maxout) which uses the all-args
constructor. If maxout is null (the common case), the single-arg constructor new
Union(datasets) would be used elsewhere, but having both annotations is redundant
and can cause confusion. Consider removing @RequiredArgsConstructor since
@AllArgsConstructor covers all cases, and use new Union(newDatasets, maxout)
consistently.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
 
   private Integer maxout;
Suggestion importance[1-10]: 3

__

Why: Having both @RequiredArgsConstructor and @AllArgsConstructor is indeed redundant since @RequiredArgsConstructor generates a constructor with only datasets (the final field) while @AllArgsConstructor covers both fields. Removing @RequiredArgsConstructor simplifies the code, but this is a minor style improvement with low functional impact.

Low
Possible issue
Unified schema should reflect widest coerced type

The unified schema uses the type from the first node that introduces a field, but
after coerceUnionTypes runs, subsequent nodes may have had their types widened to a
common supertype. The schema builder should use the coerced (target) type for each
field rather than the first-seen type, otherwise buildProjectionForUnion may emit
unnecessary CASTs or miss type mismatches. Consider deriving the unified schema from
the coerced target type map instead of the first-seen field type.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
-  Map<String, RelDataType> seenFields = new HashMap<>();
+  Map<String, RelDataType> seenFields = new LinkedHashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        seenFields.put(name, field.getType());
+        schema.add(new SchemaField(name, field.getType()));
+      } else {
+        // Update to the widest type seen (post-coercion nodes may have wider types)
+        RelDataType existing = seenFields.get(name);
+        if (!existing.equals(field.getType())) {
+          // Replace with the current node's type if it is wider (coercion already applied)
+          seenFields.put(name, field.getType());
+          schema.set(schema.indexOf(new SchemaField(name, existing)), new SchemaField(name, field.getType()));
+        }
       }
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 4

__

Why: The concern is valid in theory — after coerceUnionTypes, all nodes should have the same type for shared fields, so the first-seen type should already be the coerced type. The suggested fix using indexOf with a new SchemaField object is fragile and may not work correctly if equals isn't defined properly. The issue is real but the proposed solution has its own problems.

Low

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 5fd5e1c

@srikanthpadakanti srikanthpadakanti force-pushed the feature/union-public branch 2 times, most recently from dcbc60f to 12b2eb5 Compare March 17, 2026 16:59
@anasalkouz anasalkouz added PPL Piped processing language feature labels Mar 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 60613ec

@mengweieric
Copy link
Copy Markdown
Collaborator

Please take a look at the CI failure.

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f05fbdb

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

Please take a look at the CI failure.

Took care of it. Please review. Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 3782336

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 9ec4314

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

Hello @mengweieric Can you please review this.

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

@anasalkouz @mengweieric Can you please review this.

@ahkcs
Copy link
Copy Markdown
Collaborator

ahkcs commented Apr 1, 2026

Please also take a look at the merge conflicts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit 23f420d

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit b6879f9

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

srikanthpadakanti commented Apr 2, 2026

Please also take a look at the merge conflicts

@ahkcs Done. The PR was opened earlier and the branch was behind by 24 commits. It has now been updated. Could you please review it before any new conflicts arise?

@srikanthpadakanti srikanthpadakanti requested a review from ahkcs April 2, 2026 03:16
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit 811fc25

Srikanth Padakanti added 3 commits April 1, 2026 22:37
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit f64bacb

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit e03a298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add Union command in PPL

4 participants