Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Reduce FST block size for BlockTreeTermsWriter #12604

Merged
merged 6 commits into from
Oct 4, 2023

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Sep 28, 2023

Description

https://blunders.io/jfr-demo/indexing-4kb-2023.09.25.18.03.36/allocations-drill-down

Nightly benchmark shows that FSTCompiler#init allocated most of the memory during indexing. This is because FSTCompiler#init will always allocate 32k bytes as we param bytesPageBits default to 15. I counted the usage of BytesStore (getPosition() when BytesStore#finish called) during the wikimediumall indexing, and the result shows that 99% FST won't even use more than 1k bytes.

BytesStore#finish called: 1000000 times

min: 1
mid: 16
avg: 64.555987
pct75: 28
pct90: 57
pct99: 525
pct999: 4957
pct9999: 29124
max: 631700

This PR proposes to reduce the block size of FST in Lucene90BlockTreeTermsWriter.

closes #12598

@gf2121
Copy link
Contributor Author

gf2121 commented Sep 29, 2023

Here is the young GC statistics and allocation profile after indexingwikimedium10m (without facets and dvs)

  main patch  diff
Time in Young Generation GC 1245 864 -30.60%
Collections 525 321 -38.86%

 

Baseline Allocation Profile

PERCENT       HEAP SAMPLES  STACK
39.26%        125353M       org.apache.lucene.util.fst.BytesStore#writeByte()
5.33%         17008M        org.apache.lucene.codecs.lucene90.Lucene90PostingsWriter#newTermState()
5.07%         16188M        java.util.Arrays#copyOfRange()
4.33%         13832M        java.lang.StringUTF16#compress()
3.97%         12688M        java.util.HashMap#newNode()
3.43%         10936M        org.apache.lucene.util.ByteBlockPool$DirectTrackingAllocator#getByteBlock()
3.28%         10461M        org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray#<init>()
2.70%         8635M         java.util.Arrays#copyOf()
2.60%         8307M         org.apache.lucene.index.ParallelPostingsArray#<init>()
2.43%         7743M         org.apache.lucene.util.BytesRef#<init>()
2.02%         6435M         org.apache.lucene.util.LongHeap#<init>()
1.94%         6202M         java.lang.String#<init>()
1.71%         5451M         org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingTerm#<init>()
1.54%         4905M         org.apache.lucene.codecs.lucene90.Lucene90NormsProducer#getNorms()
1.50%         4779M         org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#write()
1.26%         4025M         java.util.ArrayList#grow()
1.20%         3835M         org.apache.lucene.util.TimSorter#<init>()
0.99%         3163M         java.util.HashMap#resize()
0.97%         3089M         org.apache.lucene.util.fst.FSTCompiler#<init>()
0.77%         2464M         org.apache.lucene.util.ArrayUtil#growExact()
0.72%         2292M         org.apache.lucene.util.fst.FSTCompiler$UnCompiledNode#<init>()
0.72%         2286M         org.apache.lucene.util.fst.FSTEnum#getArc()
0.63%         1998M         org.apache.lucene.util.BytesRefHash#rehash()
0.59%         1887M         java.util.regex.Matcher#<init>()
0.58%         1848M         org.apache.lucene.codecs.lucene90.PForUtil#encode()
0.55%         1758M         java.text.CalendarBuilder#<init>()
0.52%         1674M         java.text.SimpleDateFormat#subParse()
0.42%         1334M         org.apache.lucene.codecs.CompetitiveImpactAccumulator#getCompetitiveFreqNormPairs()
0.39%         1252M         java.lang.StringBuffer#toString()
0.39%         1251M         java.text.DecimalFormat#parse()

Candidate Allocation Profile

PERCENT       HEAP SAMPLES  STACK
8.56%         16863M        org.apache.lucene.codecs.lucene90.Lucene90PostingsWriter#newTermState()
8.27%         16280M        java.util.Arrays#copyOfRange()
6.89%         13577M        java.lang.StringUTF16#compress()
6.31%         12423M        java.util.HashMap#newNode()
5.59%         11007M        org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray#<init>()
5.40%         10633M        org.apache.lucene.util.ByteBlockPool$DirectTrackingAllocator#getByteBlock()
4.39%         8637M         org.apache.lucene.index.ParallelPostingsArray#<init>()
4.13%         8129M         java.util.Arrays#copyOf()
3.61%         7117M         org.apache.lucene.util.BytesRef#<init>()
3.55%         6985M         org.apache.lucene.util.LongHeap#<init>()
3.17%         6236M         java.lang.String#<init>()
2.84%         5584M         org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingTerm#<init>()
2.59%         5101M         org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#write()
2.52%         4970M         org.apache.lucene.codecs.lucene90.Lucene90NormsProducer#getNorms()
2.10%         4131M         org.apache.lucene.util.fst.BytesStore#writeByte()
1.92%         3788M         org.apache.lucene.util.TimSorter#<init>()
1.72%         3382M         java.util.HashMap#resize()
1.54%         3041M         org.apache.lucene.util.fst.FSTCompiler#<init>()
1.31%         2574M         org.apache.lucene.util.ArrayUtil#growExact()
1.21%         2374M         org.apache.lucene.util.fst.FSTCompiler$UnCompiledNode#<init>()
1.04%         2049M         org.apache.lucene.util.fst.FSTEnum#getArc()
0.98%         1929M         java.text.CalendarBuilder#<init>()
0.98%         1926M         java.util.ArrayList#grow()
0.98%         1926M         org.apache.lucene.util.BytesRefHash#rehash()
0.85%         1677M         org.apache.lucene.codecs.lucene90.PForUtil#encode()
0.84%         1647M         java.text.SimpleDateFormat#subParse()
0.78%         1543M         java.util.regex.Matcher#<init>()
0.71%         1402M         java.text.DecimalFormat#parse()
0.70%         1376M         org.apache.lucene.codecs.CompetitiveImpactAccumulator#getCompetitiveFreqNormPairs()
0.68%         1338M         sun.util.locale.provider.DateFormatSymbolsProviderImpl#getInstance()

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 2, 2023

Hi @jpountz ! Would you please take a look at this PR when you have time? Looking forward to getting your suggestions on this topic ~

@gf2121 gf2121 requested a review from jpountz October 2, 2023 09:15
@jpountz
Copy link
Contributor

jpountz commented Oct 2, 2023

Oh, interesting find, it makes sense to me but I'm not the most familiar one with this piece of code. @mikemccand or @s1monw what do you think?

@s1monw
Copy link
Member

s1monw commented Oct 2, 2023

This change makes sense to me. @mikemccand WDYT

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only possible risk I see is that when writing truly massive segments with many terms in a given field, a large FST will now require 32X the number of blocks to hold it. This larger array will add GC cost, slow down writing a bit, etc.

But I think that's an OK tradeoff: writing such large segments is already massively more costly than writing tiny segments, so in proportion that added cost is acceptable.

We are not really at risk of exhausting the max size of a java array -- that'd happen if one tried to build a ~2 TB FST, which is not even really feasible today since the FST is still fully heap resident at write time.

Thanks @gf2121!

@@ -163,6 +163,8 @@ Optimizations
* GITHUB#12382: Faster top-level conjunctions on term queries when sorting by
descending score. (Adrien Grand)

* GITHUB#12604: Reduce block size of FST BytesStore in BlockTreeTermsWriter. (Guo Feng)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some words that the end user could understand -- maybe reducing GC load during indexing or so?

@s1monw
Copy link
Member

s1monw commented Oct 3, 2023

@mikemccand maybe we can tradeoff here between segments we write the first time ie through IW and segments we write caused by a merge? it might mitigate your concerns.

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 3, 2023

Thanks for all review and suggestions here!

@mikemccand maybe we can tradeoff here between segments we write the first time ie through IW and segments we write caused by a merge? it might mitigate your concerns.

Thanks @s1monw , I really like the idea that we can estimate the page size before building the FST!

A tiny concern is that we could probably build a big FST if IW has a large flush buffer, or we could build small FST when tiny segments merge. This commit tries a way to estimate a more accurate size of the FST.

I did the similar count of BytesStore usage for wikimediumall again:

FST built 1000000 times

min ProfileInfo{bytesUsed=1, estimateSize=0, pageBits=6, pageNum=1}
pct50 ProfileInfo{bytesUsed=16, estimateSize=5, pageBits=6, pageNum=1}
pct75 ProfileInfo{bytesUsed=23, estimateSize=17, pageBits=6, pageNum=1}
pct90 ProfileInfo{bytesUsed=43, estimateSize=44, pageBits=6, pageNum=1}
pct99 ProfileInfo{bytesUsed=539, estimateSize=563, pageBits=10, pageNum=1}
pct999 ProfileInfo{bytesUsed=5026, estimateSize=4641, pageBits=13, pageNum=1}
pct9999 ProfileInfo{bytesUsed=32524, estimateSize=31522, pageBits=15, pageNum=1}
max ProfileInfo{bytesUsed=630865, estimateSize=610855, pageBits=15, pageNum=20}

 
I also get the percentile info of pageNum. It shows that we are using <= 3 page for 99.99% FSTs, and at most using 20 page for the largest FST. We are doing as good as before for large BytesStore now :)

FST built 1000000 times

min ProfileInfo{bytesUsed=10, estimateSize=3, pageBits=6, pageNum=1}
pct50 ProfileInfo{bytesUsed=347, estimateSize=292, pageBits=9, pageNum=1}
pct75 ProfileInfo{bytesUsed=21, estimateSize=4, pageBits=6, pageNum=1}
pct90 ProfileInfo{bytesUsed=22, estimateSize=8, pageBits=6, pageNum=1}
pct99 ProfileInfo{bytesUsed=37, estimateSize=37, pageBits=6, pageNum=1}
pct999 ProfileInfo{bytesUsed=71, estimateSize=61, pageBits=6, pageNum=2}
pct9999 ProfileInfo{bytesUsed=130, estimateSize=62, pageBits=6, pageNum=3}
max ProfileInfo{bytesUsed=630865, estimateSize=610855, pageBits=15, pageNum=20}

@gf2121 gf2121 requested a review from mikemccand October 3, 2023 16:51
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach! I had forgotten that BlockTree creates little baby FSTs all the way up the prefix tree, so it's creating (and discarding) many FSTs in order to build the final FST, making this change all the more important!

I just left a tiny comment about the CHANGES entry and a few "later" comments -- maybe open follow-on issues for those ideas?

@@ -163,6 +163,9 @@ Optimizations
* GITHUB#12382: Faster top-level conjunctions on term queries when sorting by
descending score. (Adrien Grand)

* GITHUB#12604: Estimate the block size of FST BytesStore in BlockTreeTermsWriter
to reducing GC load during indexing. (Guo Feng)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reducing -> reduce

@@ -490,10 +491,22 @@ public void compileIndex(
}
}

long estimateSize = prefix.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad we don't have a writer that uses tiny (like 8 bytes) block at first, but doubles size for each new block (16 bytes, 32 bytes next, etc.). Then we would naturally use log(size) number of blocks without over-allocating.

But then reading bytes is a bit tricky because we'd need to take discrete log (base 2) of the address. Maybe it wouldn't be so bad -- we could do this with Long.numberOfLeadingZeros maybe? But that's a bigger change ... we can do this separately/later.

@@ -490,10 +491,22 @@ public void compileIndex(
}
}

long estimateSize = prefix.length;
for (PendingBlock block : blocks) {
if (block.subIndices != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also should really explore the TODO above to write vLong in opposite byte order -- this might save quite a bit of storage in the FST since outputs would share more prefixes. Again, separate issue 😀

Copy link
Member

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

@jpountz
Copy link
Contributor

jpountz commented Oct 10, 2023

It looks like there's a bit less Young GC in nightly benchmarks since this change was merged, from 6-8 seconds, to consistently below 6s. I pushed an annotation.

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 10, 2023

@jpountz Thanks for annotating !

I also checked blunders.io for more details:

  • GC pause time: 6.38% -> 5.91%
  • Allocation Rate: 3.7 GiB/s -> 2.6 GiB/s
  • much more less FST#init in allocation flame graph :)

Before Patch https://blunders.io/jfr-demo/indexing-4kb-2023.10.03.18.03.47/allocations-drill-down
After Patch https://blunders.io/jfr-demo/indexing-4kb-2023.10.04.18.03.40/allocations-drill-down

s1monw pushed a commit to s1monw/lucene that referenced this pull request Oct 10, 2023
@gf2121 gf2121 added this to the 9.9.0 milestone Oct 18, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FST#Compiler allocates too much memory
4 participants