-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…timestamp/date columns
|
| sort order: + | ||
| Map-reduce partition columns: _col0 (type: string) | ||
| Statistics: Num rows: 500000 Data size: 139500000 Basic stats: COMPLETE Column stats: COMPLETE | ||
| value expressions: _col1 (type: bigint), _col2 (type: string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the test case on the current master branch and obtained the following result. The main difference is likely the number of rows generated by the ReduceSinkOperator: mine is 3, and yours is 500k. Since the map-side aggregation generates at most 20 keys, I'd say the estimation here should be O(N), where N = 20. Therefore, 3 is likely a more reasonable value to me. I guess I'm overlooking something, and I'd appreciate it if you could validate my assumption.
Map 1
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 1000000 Data size: 596000000 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: CASE WHEN (cat BETWEEN 0 AND 4) THEN ('K00') WHEN (cat BETWEEN 5 AND 9) THEN ('K01') WHEN (cat BETWEEN 10 AND
14) THEN ('K02') WHEN (cat BETWEEN 15 AND 19) THEN ('K03') WHEN (cat BETWEEN 20 AND 24) THEN ('K04') WHEN (cat BETWEEN 25 AND 29) THEN ('K05')
WHEN (cat BETWEEN 30 AND 34) THEN ('K06') WHEN (cat BETWEEN 35 AND 39) THEN ('K07') WHEN (cat BETWEEN 40 AND 44) THEN ('K08') WHEN (cat BETWEEN
45 AND 49) THEN ('K09') WHEN (cat BETWEEN 50 AND 54) THEN ('K10') WHEN (cat BETWEEN 55 AND 59) THEN ('K11') WHEN (cat BETWEEN 60 AND 64) THEN
('K12') WHEN (cat BETWEEN 65 AND 69) THEN ('K13') WHEN (cat BETWEEN 70 AND 74) THEN ('K14') WHEN (cat BETWEEN 75 AND 79) THEN ('K15') WHEN (cat
BETWEEN 80 AND 84) THEN ('K16') WHEN (cat BETWEEN 85 AND 89) THEN ('K17') WHEN (cat BETWEEN 90 AND 94) THEN ('K18') ELSE ('K19') END (type: st
ring), val (type: bigint), data (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1000000 Data size: 596000000 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: sum(_col1), max(_col2)
keys: _col0 (type: string)
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 3 Data size: 837 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
null sort order: z
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 3 Data size: 837 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: bigint), _col2 (type: string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okumin thank you very much for your feedback. I got a bit carried away and overlooked so inflated estimation numbers. Trying a fix that calculatesd "honest" NDV of multibranch constant expressions before falling back to the "Pessimistic" combiner
…ses before falling back to pessimistic combining
| updatedCS.setAvgColLen(Math.max(updatedCS.getAvgColLen(), cs.getAvgColLen())); | ||
| updatedCS.setNumNulls(StatsUtils.safeAdd(updatedCS.getNumNulls(), cs.getNumNulls())); | ||
| updatedCS.setCountDistint(Math.max(updatedCS.getCountDistint(), cs.getCountDistint())); | ||
| if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if(updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) { | |
| if (updatedCS.getCountDistint() > 0 && cs.getCountDistint() > 0) { |
| */ | ||
| default Optional<ColStatistics> estimate(List<ColStatistics> argStats, List<ExprNodeDesc> argExprs) { | ||
| return estimate(argStats); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can satisfy the requirements for the current use case without adding a new method. We may obtain the required information via GenericUDF#initialize if it's been initialized. If not initialized, we will probably need this method. This example materializes a constant at compile-time.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java
| } | ||
|
|
||
| @Override | ||
| public Optional<ColStatistics> estimate(List<ColStatistics> argStats, List<ExprNodeDesc> argExprs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm clarifying my understanding. Please let me know if I'm overlooking something.
Let's assume the number of distinct values of col_2 is 2, that of col_100 is 100, and that of col_999 is 999.
The true NDV of the following expression is 3. The original implementation returns 1, and this implementation returns 3.
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
ELSE 'CODE_ELSE'
ENDThat of this is 2. The original implementation returns 1, and this implementation returns 2.
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
ELSE 'CODE_01'
ENDThat of this is 100, 101, or 102. The original implementation returns 100, and this implementation returns 100.
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
ELSE col_100
ENDThat of this is 999 ~ 1100. The original implementation returns 999, and this implementation returns 999.
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN col_999
ELSE col_100
ENDThat of this is 6 ~ 8. The original implementation returns 2, and this implementation returns 2.
CASE
WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
WHEN category BETWEEN 10 AND 14 THEN 'CODE_02'
WHEN category BETWEEN 15 AND 19 THEN 'CODE_03'
WHEN category BETWEEN 20 AND 24 THEN 'CODE_04'
WHEN category BETWEEN 25 AND 29 THEN 'CODE_05'
ELSE col_2
ENDI'd say the current patch does not introduce worse estimation in any case.
| } | ||
| if (argStats.size() % 2 == 1) { | ||
| combiner.add(argStats.get(argStats.size() - 1)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify the implementation and handle a few more general cases? This is an idea I'm not obsessed with.
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentTypeException {
...
numberOfDistinctConstants = ???
}
static class WhenStatEstimator implements StatEstimator {
@Override
public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
...
var statistics = combiner.getResult();
if (statistics.getCountDistint() > 0 && numberOfDistinctConstants > statistics.getCountDistint()) {
statistics.setCountDistinct(numberOfDistinctConstants);
}
return statistics;
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okumin could you please take a look at my latest changes? I believe the logic is much more straightforward now
| break; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This is probably ok but I want to check it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okumin I am unsure I fully understand this comment, could you please provide more info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@konstantinb Sorry for confusing you. This is a comment for myself. I took a glance at this code, and it seems to be OK, but I have not dug into the entire semantics of JoinStatsRule. I can't merge an OSS pull request with a very optimistic imagination. So, I want to take a deep look again later. I'll write this sort of info in my private note next time. Sorry
…cStatCombiner to use more accurate stats while still falling back to "unknown NDV" when identified
okumin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may add more comments after checking the CI results.
Please also follow SonarQube
https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6244&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true
| if (numberOfDistinctConstants > 1) { | ||
| ColStatistics constantsStat = new ColStatistics("_constants", "string"); | ||
| constantsStat.setCountDistint(numberOfDistinctConstants); | ||
| combiner.add(constantsStat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a bit more explicit? Let's say Alice will update PessimisticStatCombiner#add in 1 year. It is not easy for her to guess that something might add a dummy ColStatistics instance. I guess we can either implement the logic directly in each UDF or add a utility method to PessimisticStatCombiner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BranchingStatEstimator.estimate() post-processing seems like a natural fit for this, thank you
| @@ -45,13 +48,16 @@ | |||
| public class GenericUDFCoalesce extends GenericUDF implements StatEstimatorProvider { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing we don't need to update this UDF. That's because the number of distinct values of coalesce(col_2, 'a', 'b', 'c', 'd') should be 2 or 3, since the result is either col_2 or 'a'. The original implementation might be more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the updates to CASE/WHEN/IF do not directly apply to COACECENSCE. applying a specific fix to return MAX(NDV(col1), ... NDV(colN)) + (1 if there's a trailing constant)
| } | ||
| if (stat.isFilteredColumn()) { | ||
| result.setFilterColumn(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change this method? I'm expecting stat = result here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not something I actually changed; the diff shows up because I've removed some pre-existing empty lines per the Quality Gate feedback
|
okumin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed 10% of files, which are likely major. Let me commit the current comments as a checkpoint.
| ndv = StatsUtils.safeAdd(ndv, 1); | ||
| } | ||
| ndvValues.add(ndv); | ||
| ndvValues.add(getGroupingColumnNdv(cs, parentStats)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a separate pull request next time you update something with global impact. This PR affects approximately 200 test cases and would make it harder for a reviewer to validate them if it included two or more types of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking a glance at all test files, I started feeling I would like to separate unrelated changes, like below.
- HIVE-29368: UDF changes
- HIVE-XXXXX:
cs.setCountDistint(csd.getTimestampStats().getNumDVs())and similar changes - HIVE-XXXXX:
getGroupingColumnNdvand related changed
This is because I can review each of them in 30 minutes if they are separated, so I will spend only 90 minutes in total. If all are included, it is not very obvious why each test case has changed. I need more focus, and we can't make a checkpoint because we can't merge it unless all changes are reasonable and all test cases are green (I know some integration tests are still failing and Sonar Cloud is reporting some remaining issues). This proposal is negotiable because it requires your efforts. I should have proposed it at the beginning.
| Optional<ColStatistics> result = combiner.getResult(); | ||
|
|
||
| // If there's a constant after columns, add 1 to NDV for that constant | ||
| if (result.isPresent() && firstConstantIndex > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be more consistent without this branch. The number of distinct values of coalesce(col_bool, false) is up to 2, and if(col_bool IS NOT NULL, col_bool, false) or an equivalent using CASE is likely to return 2, but this probably returns 3 if I understand correctly.
| EXPLAIN SELECT CASE WHEN cond=1 THEN c2 ELSE c100 END x FROM t GROUP BY CASE WHEN cond=1 THEN c2 ELSE c100 END; | ||
|
|
||
| -- CASE WHEN: no ELSE clause (NDV=1, implicit NULL ELSE is not a ConstantObjectInspector) | ||
| EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' END x FROM t GROUP BY CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' END; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this is not identical to this test case



What changes were proposed in this pull request?
HIVE-29368: more conservative NDV combining by PessimisticStatCombiner
Why are the changes needed?
These changes prevent severe underestimation of records' statistics, which often lead to query failures on large data sets
Does this PR introduce any user-facing change?
NO
How was this patch tested?
Extensive regression testing in a private fork; new and updated query files in this PR