HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244

konstantinb · 2025-12-18T18:50:03Z

What changes were proposed in this pull request?

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner

Why are the changes needed?

These changes prevent severe underestimation of records' statistics, which often lead to query failures on large data sets

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Extensive regression testing in a private fork; new and updated query files in this PR

…comment

…r now

…f it is "known"

…timestamp/date columns

sonarqubecloud · 2025-12-30T18:35:55Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

deniskuzZ · 2026-01-15T13:33:17Z

cc @zabetak, @thomasrebele

okumin · 2026-02-04T06:04:33Z

ql/src/test/results/clientpositive/llap/pessimistic_stat_combiner_ndv.q.out

+                        sort order: +
+                        Map-reduce partition columns: _col0 (type: string)
+                        Statistics: Num rows: 500000 Data size: 139500000 Basic stats: COMPLETE Column stats: COMPLETE
+                        value expressions: _col1 (type: bigint), _col2 (type: string)


I ran the test case on the current master branch and obtained the following result. The main difference is likely the number of rows generated by the ReduceSinkOperator: mine is 3, and yours is 500k. Since the map-side aggregation generates at most 20 keys, I'd say the estimation here should be O(N), where N = 20. Therefore, 3 is likely a more reasonable value to me. I guess I'm overlooking something, and I'd appreciate it if you could validate my assumption.

Map 1 Map Operator Tree: TableScan alias: t1 Statistics: Num rows: 1000000 Data size: 596000000 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: CASE WHEN (cat BETWEEN 0 AND 4) THEN ('K00') WHEN (cat BETWEEN 5 AND 9) THEN ('K01') WHEN (cat BETWEEN 10 AND 14) THEN ('K02') WHEN (cat BETWEEN 15 AND 19) THEN ('K03') WHEN (cat BETWEEN 20 AND 24) THEN ('K04') WHEN (cat BETWEEN 25 AND 29) THEN ('K05') WHEN (cat BETWEEN 30 AND 34) THEN ('K06') WHEN (cat BETWEEN 35 AND 39) THEN ('K07') WHEN (cat BETWEEN 40 AND 44) THEN ('K08') WHEN (cat BETWEEN 45 AND 49) THEN ('K09') WHEN (cat BETWEEN 50 AND 54) THEN ('K10') WHEN (cat BETWEEN 55 AND 59) THEN ('K11') WHEN (cat BETWEEN 60 AND 64) THEN ('K12') WHEN (cat BETWEEN 65 AND 69) THEN ('K13') WHEN (cat BETWEEN 70 AND 74) THEN ('K14') WHEN (cat BETWEEN 75 AND 79) THEN ('K15') WHEN (cat BETWEEN 80 AND 84) THEN ('K16') WHEN (cat BETWEEN 85 AND 89) THEN ('K17') WHEN (cat BETWEEN 90 AND 94) THEN ('K18') ELSE ('K19') END (type: st ring), val (type: bigint), data (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1000000 Data size: 596000000 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: sum(_col1), max(_col2) keys: _col0 (type: string) minReductionHashAggr: 0.99 mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 3 Data size: 837 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string) null sort order: z sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 3 Data size: 837 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: bigint), _col2 (type: string)

@okumin thank you very much for your feedback. I got a bit carried away and overlooked so inflated estimation numbers. Trying a fix that calculatesd "honest" NDV of multibranch constant expressions before falling back to the "Pessimistic" combiner

…ses before falling back to pessimistic combining

okumin · 2026-02-05T03:01:24Z

ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java

            updatedCS.setAvgColLen(Math.max(updatedCS.getAvgColLen(), cs.getAvgColLen()));
            updatedCS.setNumNulls(StatsUtils.safeAdd(updatedCS.getNumNulls(), cs.getNumNulls()));
-            updatedCS.setCountDistint(Math.max(updatedCS.getCountDistint(), cs.getCountDistint()));
+            if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {


Suggested change

if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {

if (updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {

okumin · 2026-02-05T04:40:56Z

ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java

+   */
+  default Optional<ColStatistics> estimate(List<ColStatistics> argStats, List<ExprNodeDesc> argExprs) {
+    return estimate(argStats);
+  }


I guess we can satisfy the requirements for the current use case without adding a new method. We may obtain the required information via GenericUDF#initialize if it's been initialized. If not initialized, we will probably need this method. This example materializes a constant at compile-time.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java

okumin · 2026-02-05T05:01:43Z

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java

+    }
+
+    @Override
+    public Optional<ColStatistics> estimate(List<ColStatistics> argStats, List<ExprNodeDesc> argExprs) {


I'm clarifying my understanding. Please let me know if I'm overlooking something.
Let's assume the number of distinct values of col_2 is 2, that of col_100 is 100, and that of col_999 is 999.

The true NDV of the following expression is 3. The original implementation returns 1, and this implementation returns 3.

CASE WHEN category BETWEEN 0 AND 4 THEN 'CODE_00' WHEN category BETWEEN 5 AND 9 THEN 'CODE_01' ELSE 'CODE_ELSE' END

That of this is 2. The original implementation returns 1, and this implementation returns 2.

CASE WHEN category BETWEEN 0 AND 4 THEN 'CODE_00' WHEN category BETWEEN 5 AND 9 THEN 'CODE_01' ELSE 'CODE_01' END

That of this is 100, 101, or 102. The original implementation returns 100, and this implementation returns 100.

CASE WHEN category BETWEEN 0 AND 4 THEN 'CODE_00' WHEN category BETWEEN 5 AND 9 THEN 'CODE_01' ELSE col_100 END

That of this is 999 ~ 1100. The original implementation returns 999, and this implementation returns 999.

CASE WHEN category BETWEEN 0 AND 4 THEN 'CODE_00' WHEN category BETWEEN 5 AND 9 THEN col_999 ELSE col_100 END

That of this is 6 ~ 8. The original implementation returns 2, and this implementation returns 2.

CASE WHEN category BETWEEN 0 AND 4 THEN 'CODE_00' WHEN category BETWEEN 5 AND 9 THEN 'CODE_01' WHEN category BETWEEN 10 AND 14 THEN 'CODE_02' WHEN category BETWEEN 15 AND 19 THEN 'CODE_03' WHEN category BETWEEN 20 AND 24 THEN 'CODE_04' WHEN category BETWEEN 25 AND 29 THEN 'CODE_05' ELSE col_2 END

I'd say the current patch does not introduce worse estimation in any case.

okumin · 2026-02-05T05:07:36Z

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java

      }
      if (argStats.size() % 2 == 1) {
        combiner.add(argStats.get(argStats.size() - 1));
      }


Can we simplify the implementation and handle a few more general cases? This is an idea I'm not obsessed with.

@Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentTypeException { ... numberOfDistinctConstants = ??? } static class WhenStatEstimator implements StatEstimator { @Override public Optional<ColStatistics> estimate(List<ColStatistics> argStats) { ... var statistics = combiner.getResult(); if (statistics.getCountDistint() > 0 && numberOfDistinctConstants > statistics.getCountDistint()) { statistics.setCountDistinct(numberOfDistinctConstants); } return statistics; } }

@okumin could you please take a look at my latest changes? I believe the logic is much more straightforward now

okumin · 2026-02-05T05:28:18Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java

+            break;
+          }
+        }
+      }


Note: This is probably ok but I want to check it again

@okumin I am unsure I fully understand this comment, could you please provide more info?

@konstantinb Sorry for confusing you. This is a comment for myself. I took a glance at this code, and it seems to be OK, but I have not dug into the entire semantics of JoinStatsRule. I can't merge an OSS pull request with a very optimistic imagination. So, I want to take a deep look again later. I'll write this sort of info in my private note next time. Sorry

…cStatCombiner to use more accurate stats while still falling back to "unknown NDV" when identified

okumin

I may add more comments after checking the CI results.
Please also follow SonarQube
https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6244&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true

okumin · 2026-02-06T01:27:13Z

ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/BranchingStatEstimator.java

+    if (numberOfDistinctConstants > 1) {
+      ColStatistics constantsStat = new ColStatistics("_constants", "string");
+      constantsStat.setCountDistint(numberOfDistinctConstants);
+      combiner.add(constantsStat);


Can we make this a bit more explicit? Let's say Alice will update PessimisticStatCombiner#add in 1 year. It is not easy for her to guess that something might add a dummy ColStatistics instance. I guess we can either implement the logic directly in each UDF or add a utility method to PessimisticStatCombiner.

BranchingStatEstimator.estimate() post-processing seems like a natural fit for this, thank you

okumin · 2026-02-06T01:36:47Z

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFCoalesce.java

@@ -45,13 +48,16 @@
 public class GenericUDFCoalesce extends GenericUDF implements StatEstimatorProvider {


I'm guessing we don't need to update this UDF. That's because the number of distinct values of coalesce(col_2, 'a', 'b', 'c', 'd') should be 2 or 3, since the result is either col_2 or 'a'. The original implementation might be more correct.

I agree that the updates to CASE/WHEN/IF do not directly apply to COACECENSCE. applying a specific fix to return MAX(NDV(col1), ... NDV(colN)) + (1 if there's a trailing constant)

okumin · 2026-02-06T03:24:36Z

ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java

+    }
+    if (stat.isFilteredColumn()) {
+      result.setFilterColumn();
+    }


Do we need to change this method? I'm expecting stat = result here

this is not something I actually changed; the diff shows up because I've removed some pre-existing empty lines per the Quality Gate feedback

sonarqubecloud · 2026-02-07T03:20:33Z

Quality Gate passed

Issues
8 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
1.9% Duplication on New Code

See analysis details on SonarQube Cloud

okumin

I've reviewed 10% of files, which are likely major. Let me commit the current comments as a checkpoint.

okumin · 2026-02-09T06:04:32Z

ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java

-          ndv = StatsUtils.safeAdd(ndv, 1);
-        }
-        ndvValues.add(ndv);
+        ndvValues.add(getGroupingColumnNdv(cs, parentStats));


Please create a separate pull request next time you update something with global impact. This PR affects approximately 200 test cases and would make it harder for a reviewer to validate them if it included two or more types of changes.

After taking a glance at all test files, I started feeling I would like to separate unrelated changes, like below.

HIVE-29368: UDF changes

HIVE-XXXXX: cs.setCountDistint(csd.getTimestampStats().getNumDVs()) and similar changes

HIVE-XXXXX: getGroupingColumnNdv and related changed

This is because I can review each of them in 30 minutes if they are separated, so I will spend only 90 minutes in total. If all are included, it is not very obvious why each test case has changed. I need more focus, and we can't make a checkpoint because we can't merge it unless all changes are reasonable and all test cases are green (I know some integration tests are still failing and Sonar Cloud is reporting some remaining issues). This proposal is negotiable because it requires your efforts. I should have proposed it at the beginning.

okumin · 2026-02-09T06:30:15Z

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFCoalesce.java

+      Optional<ColStatistics> result = combiner.getResult();
+
+      // If there's a constant after columns, add 1 to NDV for that constant
+      if (result.isPresent() && firstConstantIndex > 0) {


I guess it would be more consistent without this branch. The number of distinct values of coalesce(col_bool, false) is up to 2, and if(col_bool IS NOT NULL, col_bool, false) or an equivalent using CASE is likely to return 2, but this probably returns 3 if I understand correctly.

okumin · 2026-02-09T06:58:35Z

ql/src/test/queries/clientpositive/branching_expr_ndv.q

+EXPLAIN SELECT CASE WHEN cond=1 THEN c2 ELSE c100 END x FROM t GROUP BY CASE WHEN cond=1 THEN c2 ELSE c100 END;
+
+-- CASE WHEN: no ELSE clause (NDV=1, implicit NULL ELSE is not a ConstantObjectInspector)
+EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' END x FROM t GROUP BY CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' END;


I wonder why this is not identical to this test case

konstantinb · 2026-02-10T00:43:12Z

@okumin this set of changes, especially for PessimisticStatCombiner, does indeed appear to create more problems that it can solve; I am now trying a more focused fix #6308

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner

633951c

asf-ci-hive added tests pending tests unstable and removed tests pending labels Dec 18, 2025

konstantinb added 2 commits December 18, 2025 17:19

HIVE-29368: regenerated impacted test results + added an explanation …

199c441

…comment

HIVE-29368: one more test file, modified using explain output only fo…

f0022f7

…r now

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 19, 2025

HIVE-29368: only increment ndv by one inextractNDVGroupingColumns() i…

bd86e3c

…f it is "known"

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 19, 2025

HIVE-29368: further tuning NDV handling, including reading stats for …

75dbdf8

…timestamp/date columns

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 19, 2025

Merge origin/master into HIVE-29368

0ddef8c

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 22, 2025

asf-ci-hive added tests passed and removed tests pending labels Dec 30, 2025

okumin reviewed Feb 4, 2026

View reviewed changes

konstantinb added 2 commits February 4, 2026 15:37

HIVE-29368: trying a more intelligent NDV estimate for CASE/WHEN clau…

cf4fa0b

…ses before falling back to pessimistic combining

Merge remote-tracking branch 'origin/master' into HIVE-29368

5bd21e6

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Feb 4, 2026

okumin reviewed Feb 5, 2026

View reviewed changes

konstantinb added 3 commits February 5, 2026 11:06

HIVE-29368: refactoring constant NDV estimates as per the PR feedback

c18f8cd

HIVE-29368: further refactoring constant NDV estimates and Pessimisti…

bb7c3fd

…cStatCombiner to use more accurate stats while still falling back to "unknown NDV" when identified

HIVE-29368: a misc tweak for empty tables + .out changes

bdc395f

asf-ci-hive added tests pending and removed tests unstable labels Feb 6, 2026

okumin reviewed Feb 6, 2026

View reviewed changes

asf-ci-hive added tests failed and removed tests pending labels Feb 6, 2026

okumin reviewed Feb 6, 2026

View reviewed changes

konstantinb added 2 commits February 6, 2026 16:41

HIVE-29368: PR feedback + some SonarQube items

459e85f

HIVE-29368: .out files + a misc tweak for NULL IF NDVs

b59cc9d

asf-ci-hive added tests pending and removed tests failed labels Feb 7, 2026

asf-ci-hive added tests unstable and removed tests pending labels Feb 7, 2026

okumin reviewed Feb 9, 2026

View reviewed changes

	if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {
	if (updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) {

		@@ -45,13 +48,16 @@
		public class GenericUDFCoalesce extends GenericUDF implements StatEstimatorProvider {

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244

Are you sure you want to change the base?

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244

Conversation

konstantinb commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sonarqubecloud bot commented Dec 30, 2025

Quality Gate passed

Uh oh!

deniskuzZ commented Jan 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okumin Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okumin Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okumin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Feb 7, 2026

Quality Gate passed

Uh oh!

okumin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okumin Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

konstantinb commented Dec 18, 2025 •

edited

Loading

okumin Feb 5, 2026 •

edited

Loading

okumin Feb 6, 2026 •

edited

Loading

konstantinb Feb 7, 2026 •

edited

Loading

okumin Feb 9, 2026 •

edited

Loading