Description
For one of the users in OpenSearch, we observed that they had a sample query like below(redacted the whole boolean query and just keeping the must_not part for reference):
{"bool":{"must_not":[{"terms":{"category":["type_a","type_b","type_c","type_d","type_e","type_f","type_g","type_h","type_i","type_j","type_k"]}}]}}
They have an Opensearch index of ~2B docs across many shards, and after upgrading to lucene 10.x from lucene 9.x, we observed a significant performance regression on this query pattern.
I see there were major changes which were done in ReqExclBulkScorer(which handles the exclusion part) where we added docIdRunEnd() optimization, basically excluding a batch of matching docs efficiently rather than going over each matching docs one by one. When a term matches a dense contiguous block of documents, this lets the scorer jump over the entire block in a single step instead of iterating through each excluded doc individually.
How Lucene 9 handled it:
In Lucene 9, multiple MUST_NOT terms were merged using a min-heap (DisiPriorityQueue). The heap always surfaces the smallest excluded doc ID across all term iterators. Each excluded doc costs one heap operation ie O(log K) where K is the number of excluded terms. Simple, predictable, and scales well regardless of how the excluded docs are distributed.
What changed in Lucene 10:
In Lucene 10, ReqExclBulkScorer now calls docIDRunEnd() on every excluded doc hit to try to skip an entire run at once. For a single MUST_NOT term with a dense posting list, this is a genuine win. The scorer can skip thousands of docs in one step. However, when there are many MUST_NOT terms (like 11 in this case) whose values are interleaved across the index, docIDRunEnd() triggers DisjunctionDISIApproximation.computeTopList(), which linearly scans all K iterators on every excluded doc hit.
In this case or generally, where we have 11 interleaved terms, and the lead iterator cannot claim a run longer than 1(or a very small value) as the next excluded doc always belongs to a different term at a different position. Yet computeTopList() still has to linearly scan all K iterators on every hit just to confirm this, paying O(K) cost for a result that's no better than the old upTo += 1. This turns the per-doc cost from O(log K) in Lucene 9 to O(K) in Lucene 10 for this pattern.
Looks like this optimization works well with few MUST_NOT terms (1-2). It regresses with many MUST_NOT terms (like 11) whose posting lists are interleaved, which is a common pattern when filtering by multiple category values on a keyword field.
Sample hot thread dump we saw:
70.1% (350.2ms out of 500ms) cpu usage by thread 'opensearch[...][search][T#42]'
3/10 snapshots sharing following 34 elements
app//org.apache.lucene.search.DisjunctionDISIApproximation.computeTopList(DisjunctionDISIApproximation.java:179)
app//org.apache.lucene.search.DisjunctionDISIApproximation.topList(DisjunctionDISIApproximation.java:169)
app//org.apache.lucene.search.DisjunctionDISIApproximation.docIDRunEnd(DisjunctionDISIApproximation.java:194)
app//org.apache.lucene.search.ReqExclBulkScorer.score(ReqExclBulkScorer.java:62)
app//org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:367)
app//org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:282)
Suggested fix
Maybe skip this approach when the exclusion is a disjunction of many terms(>2), falling back to simpler one-doc at-a-time approach like in lucene 9, which seems more efficient for this pattern.
Description
For one of the users in OpenSearch, we observed that they had a sample query like below(redacted the whole boolean query and just keeping the must_not part for reference):
They have an Opensearch index of ~2B docs across many shards, and after upgrading to lucene 10.x from lucene 9.x, we observed a significant performance regression on this query pattern.
I see there were major changes which were done in
ReqExclBulkScorer(which handles the exclusion part) where we addeddocIdRunEnd()optimization, basically excluding a batch of matching docs efficiently rather than going over each matching docs one by one. When a term matches a dense contiguous block of documents, this lets the scorer jump over the entire block in a single step instead of iterating through each excluded doc individually.How Lucene 9 handled it:
In Lucene 9, multiple MUST_NOT terms were merged using a min-heap (
DisiPriorityQueue). The heap always surfaces the smallest excluded doc ID across all term iterators. Each excluded doc costs one heap operation ie O(log K) where K is the number of excluded terms. Simple, predictable, and scales well regardless of how the excluded docs are distributed.What changed in Lucene 10:
In Lucene 10,
ReqExclBulkScorernow callsdocIDRunEnd()on every excluded doc hit to try to skip an entire run at once. For a single MUST_NOT term with a dense posting list, this is a genuine win. The scorer can skip thousands of docs in one step. However, when there are many MUST_NOT terms (like 11 in this case) whose values are interleaved across the index,docIDRunEnd()triggersDisjunctionDISIApproximation.computeTopList(), which linearly scans all K iterators on every excluded doc hit.In this case or generally, where we have 11 interleaved terms, and the lead iterator cannot claim a run longer than 1(or a very small value) as the next excluded doc always belongs to a different term at a different position. Yet
computeTopList()still has to linearly scan all K iterators on every hit just to confirm this, paying O(K) cost for a result that's no better than the old upTo += 1. This turns the per-doc cost from O(log K) in Lucene 9 to O(K) in Lucene 10 for this pattern.Looks like this optimization works well with few MUST_NOT terms (1-2). It regresses with many MUST_NOT terms (like 11) whose posting lists are interleaved, which is a common pattern when filtering by multiple category values on a keyword field.
Sample hot thread dump we saw:
Suggested fix
Maybe skip this approach when the exclusion is a disjunction of many terms(>2), falling back to simpler one-doc at-a-time approach like in lucene 9, which seems more efficient for this pattern.