Skip to content

Conversation

@konstantinb
Copy link
Contributor

@konstantinb konstantinb commented Dec 18, 2025

What changes were proposed in this pull request?

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner

Why are the changes needed?

These changes prevent severe underestimation of records' statistics, which often lead to query failures on large data sets

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Extensive regression testing in a private fork; new and updated query files in this PR

@sonarqubecloud
Copy link

@deniskuzZ
Copy link
Member

cc @zabetak, @thomasrebele

sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 500000 Data size: 139500000 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: bigint), _col2 (type: string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the test case on the current master branch and obtained the following result. The main difference is likely the number of rows generated by the ReduceSinkOperator: mine is 3, and yours is 500k. Since the map-side aggregation generates at most 20 keys, I'd say the estimation here should be O(N), where N = 20. Therefore, 3 is likely a more reasonable value to me. I guess I'm overlooking something, and I'd appreciate it if you could validate my assumption.

        Map 1 
            Map Operator Tree:
                TableScan
                  alias: t1
                  Statistics: Num rows: 1000000 Data size: 596000000 Basic stats: COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: CASE WHEN (cat BETWEEN 0 AND 4) THEN ('K00') WHEN (cat BETWEEN 5 AND 9) THEN ('K01') WHEN (cat BETWEEN 10 AND 
14) THEN ('K02') WHEN (cat BETWEEN 15 AND 19) THEN ('K03') WHEN (cat BETWEEN 20 AND 24) THEN ('K04') WHEN (cat BETWEEN 25 AND 29) THEN ('K05') 
WHEN (cat BETWEEN 30 AND 34) THEN ('K06') WHEN (cat BETWEEN 35 AND 39) THEN ('K07') WHEN (cat BETWEEN 40 AND 44) THEN ('K08') WHEN (cat BETWEEN
 45 AND 49) THEN ('K09') WHEN (cat BETWEEN 50 AND 54) THEN ('K10') WHEN (cat BETWEEN 55 AND 59) THEN ('K11') WHEN (cat BETWEEN 60 AND 64) THEN 
('K12') WHEN (cat BETWEEN 65 AND 69) THEN ('K13') WHEN (cat BETWEEN 70 AND 74) THEN ('K14') WHEN (cat BETWEEN 75 AND 79) THEN ('K15') WHEN (cat
 BETWEEN 80 AND 84) THEN ('K16') WHEN (cat BETWEEN 85 AND 89) THEN ('K17') WHEN (cat BETWEEN 90 AND 94) THEN ('K18') ELSE ('K19') END (type: st
ring), val (type: bigint), data (type: string)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 1000000 Data size: 596000000 Basic stats: COMPLETE Column stats: COMPLETE
                    Group By Operator
                      aggregations: sum(_col1), max(_col2)
                      keys: _col0 (type: string)
                      minReductionHashAggr: 0.99
                      mode: hash
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 3 Data size: 837 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        null sort order: z
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 3 Data size: 837 Basic stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col1 (type: bigint), _col2 (type: string)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okumin thank you very much for your feedback. I got a bit carried away and overlooked so inflated estimation numbers. Trying a fix that calculatesd "honest" NDV of multibranch constant expressions before falling back to the "Pessimistic" combiner

updatedCS.setAvgColLen(Math.max(updatedCS.getAvgColLen(), cs.getAvgColLen()));
updatedCS.setNumNulls(StatsUtils.safeAdd(updatedCS.getNumNulls(), cs.getNumNulls()));
updatedCS.setCountDistint(Math.max(updatedCS.getCountDistint(), cs.getCountDistint()));
if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {
if (updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {

*/
default Optional<ColStatistics> estimate(List<ColStatistics> argStats, List<ExprNodeDesc> argExprs) {
return estimate(argStats);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can satisfy the requirements for the current use case without adding a new method. We may obtain the required information via GenericUDF#initialize if it's been initialized. If not initialized, we will probably need this method. This example materializes a constant at compile-time.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java

}

@Override
public Optional<ColStatistics> estimate(List<ColStatistics> argStats, List<ExprNodeDesc> argExprs) {
Copy link
Contributor

@okumin okumin Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm clarifying my understanding. Please let me know if I'm overlooking something.
Let's assume the number of distinct values of col_2 is 2, that of col_100 is 100, and that of col_999 is 999.

The true NDV of the following expression is 3. The original implementation returns 1, and this implementation returns 3.

CASE
  WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
  WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
  ELSE 'CODE_ELSE'
END

That of this is 2. The original implementation returns 1, and this implementation returns 2.

CASE
  WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
  WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
  ELSE 'CODE_01'
END

That of this is 100, 101, or 102. The original implementation returns 100, and this implementation returns 100.

CASE
  WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
  WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
  ELSE col_100
END

That of this is 999 ~ 1100. The original implementation returns 999, and this implementation returns 999.

CASE
  WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
  WHEN category BETWEEN 5 AND 9 THEN col_999
  ELSE col_100
END

That of this is 6 ~ 8. The original implementation returns 2, and this implementation returns 2.

CASE
    WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
    WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
    WHEN category BETWEEN 10 AND 14 THEN 'CODE_02'
    WHEN category BETWEEN 15 AND 19 THEN 'CODE_03'
    WHEN category BETWEEN 20 AND 24 THEN 'CODE_04'
    WHEN category BETWEEN 25 AND 29 THEN 'CODE_05'
  ELSE col_2
END

I'd say the current patch does not introduce worse estimation in any case.

}
if (argStats.size() % 2 == 1) {
combiner.add(argStats.get(argStats.size() - 1));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify the implementation and handle a few more general cases? This is an idea I'm not obsessed with.

  @Override
  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentTypeException {
    ...
    numberOfDistinctConstants = ???
  }

    static class WhenStatEstimator implements StatEstimator {

    @Override
    public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
      ...
      var statistics = combiner.getResult();
      if (statistics.getCountDistint() > 0 && numberOfDistinctConstants > statistics.getCountDistint()) {
        statistics.setCountDistinct(numberOfDistinctConstants);
      }
      return statistics;
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okumin could you please take a look at my latest changes? I believe the logic is much more straightforward now

break;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is probably ok but I want to check it again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okumin I am unsure I fully understand this comment, could you please provide more info?

Copy link
Contributor

@okumin okumin Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@konstantinb Sorry for confusing you. This is a comment for myself. I took a glance at this code, and it seems to be OK, but I have not dug into the entire semantics of JoinStatsRule. I can't merge an OSS pull request with a very optimistic imagination. So, I want to take a deep look again later. I'll write this sort of info in my private note next time. Sorry

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may add more comments after checking the CI results.
Please also follow SonarQube
https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6244&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true

if (numberOfDistinctConstants > 1) {
ColStatistics constantsStat = new ColStatistics("_constants", "string");
constantsStat.setCountDistint(numberOfDistinctConstants);
combiner.add(constantsStat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a bit more explicit? Let's say Alice will update PessimisticStatCombiner#add in 1 year. It is not easy for her to guess that something might add a dummy ColStatistics instance. I guess we can either implement the logic directly in each UDF or add a utility method to PessimisticStatCombiner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BranchingStatEstimator.estimate() post-processing seems like a natural fit for this, thank you

@@ -45,13 +48,16 @@
public class GenericUDFCoalesce extends GenericUDF implements StatEstimatorProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing we don't need to update this UDF. That's because the number of distinct values of coalesce(col_2, 'a', 'b', 'c', 'd') should be 2 or 3, since the result is either col_2 or 'a'. The original implementation might be more correct.

Copy link
Contributor Author

@konstantinb konstantinb Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the updates to CASE/WHEN/IF do not directly apply to COACECENSCE. applying a specific fix to return MAX(NDV(col1), ... NDV(colN)) + (1 if there's a trailing constant)

}
if (stat.isFilteredColumn()) {
result.setFilterColumn();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change this method? I'm expecting stat = result here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not something I actually changed; the diff shows up because I've removed some pre-existing empty lines per the Quality Gate feedback

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 7, 2026

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed 10% of files, which are likely major. Let me commit the current comments as a checkpoint.

ndv = StatsUtils.safeAdd(ndv, 1);
}
ndvValues.add(ndv);
ndvValues.add(getGroupingColumnNdv(cs, parentStats));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a separate pull request next time you update something with global impact. This PR affects approximately 200 test cases and would make it harder for a reviewer to validate them if it included two or more types of changes.

Copy link
Contributor

@okumin okumin Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking a glance at all test files, I started feeling I would like to separate unrelated changes, like below.

  • HIVE-29368: UDF changes
  • HIVE-XXXXX: cs.setCountDistint(csd.getTimestampStats().getNumDVs()) and similar changes
  • HIVE-XXXXX: getGroupingColumnNdv and related changed

This is because I can review each of them in 30 minutes if they are separated, so I will spend only 90 minutes in total. If all are included, it is not very obvious why each test case has changed. I need more focus, and we can't make a checkpoint because we can't merge it unless all changes are reasonable and all test cases are green (I know some integration tests are still failing and Sonar Cloud is reporting some remaining issues). This proposal is negotiable because it requires your efforts. I should have proposed it at the beginning.

Optional<ColStatistics> result = combiner.getResult();

// If there's a constant after columns, add 1 to NDV for that constant
if (result.isPresent() && firstConstantIndex > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be more consistent without this branch. The number of distinct values of coalesce(col_bool, false) is up to 2, and if(col_bool IS NOT NULL, col_bool, false) or an equivalent using CASE is likely to return 2, but this probably returns 3 if I understand correctly.

EXPLAIN SELECT CASE WHEN cond=1 THEN c2 ELSE c100 END x FROM t GROUP BY CASE WHEN cond=1 THEN c2 ELSE c100 END;

-- CASE WHEN: no ELSE clause (NDV=1, implicit NULL ELSE is not a ConstantObjectInspector)
EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' END x FROM t GROUP BY CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' END;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this is not identical to this test case

@konstantinb
Copy link
Contributor Author

@okumin this set of changes, especially for PessimisticStatCombiner, does indeed appear to create more problems that it can solve; I am now trying a more focused fix #6308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants