lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6383) MemoryPostings fst encoding can be surprisingly inefficient (especially in tests, with payloads)
Date Wed, 01 Apr 2015 07:25:53 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390135#comment-14390135
] 

Adrien Grand commented on LUCENE-6383:
--------------------------------------

bq. We should also look into why Adrien Grand's test for "things getting bigger on merge"
(BaseIndexFileFormatTestCase.testMergeStability) doesnt find this. 

>From your description of the problem, it looks to me that payloads are inefficiently encoded,
but not that they wrongly accumulate upon merging? (which is what the test checks) We added
this test when we found a bug in a codec that kept on copying the codec footer when merging
so that after N merges, some segment files would have N codec footers (with only the last
one containing the right checksum). The issue looks different here?

> MemoryPostings fst encoding can be surprisingly inefficient (especially in tests, with
payloads)
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6383
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6383
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>
> I just worked around this in 2 nightly OOM fails.
> One was TestDuelingCodecs, the other was TestIndexWriterForceMerge's space usage test.
> In general the trend is the same, it seems the more documents you merge, you just get
bigger and bigger FST outputs and the size of this PF in ram and on disk grows in a way you
don't expect. E.g. merging 300KB of segments resulted in 450KB single segment, and memory
usage gets absurdly high.
> The issue seems especially aggravated in tests, when MockAnalyzer adds lots of payloads.
> Maybe it should encode the postings data in a more efficient way? Can it just use a Long
output pointing into a RAMFile or something? Or maybe there is just a crazy bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message