lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Sevel <>
Subject large set of memory consumed by array init
Date Thu, 15 Dec 2016 17:14:16 GMT

I have seen an unexpected behavior when setting a limit too high in a search.
I index log files in my system. Each week I create a new index. At the end of the week the
index is around 35 Gb.
when I do a search with no date, I would create a MultiReader built out of the readers from
the weekly indexes (around 15 indexes).
I sort with Sort.INDEXORDER by default.
when running a search with a high limit (eg: 1 million), I ended up sometimes going out of
memory because of the empty datastructures that were initialized.
I looked at the objects and in the CollectorManager. reduce(Collection<TopFieldCollector>
collectors) of IndexSearcher.searchAfter(FieldDoc after, Query query, int numHits, Sort sort,
boolean doDocScores, boolean doMaxScore), I ended up with 137 collectors each defining a HitQueue

-       oneComparator.docIDs[1 million]

-       heap[1 million]

so that is 8 bytes * 1 million = around 8 Mb
since all 137 collectors had been initialized the same way (with arrays with 1 million elements),
then I ended up with 1 Gb of RAM used for that search.
What is strange to me is that no collector had 1 million hits, because the 5 million log events
that matched were spread around the different weekly indexes.
so that seemed quite a waste of space to initialize all of these arrays with slots that would
not be used.
so the only thing I could think of was to get rid of the MultiReader, manage the search myself
on the different subreaders, and adjust the limit to be the min between the count of the query
on each index and the limit passed by the user (similar to the way the cappedNumHits gets
calculated). That way, a user passing a very big limit would not be able to consume so much
I guess it makes sense to preallocate data structures to be more efficient on the garbage
collection by avoid growing arrays and list, but I must admit that I did not expect that the
limit parameter could have such an impact on memory. In this situation, I would rather have
those arrays start small and grow based on needs.
As a side effect I would have to get rid of the MultiReader, which is a nice abstraction,
and simplifies my code. I would rather not, but I want to be very careful about memory consumption,
and it always looks bad when a user can create an OOM on a server just with a query, even
if he is passing an abnormal high limit.

what are your recommendations?
using lucene 6.2.1

[[ rethink everything. ]]<>

DISCLAIMER **********************************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not constitute
a formal commitment by Bank Lombard Odier & Co Ltd
or any of its branches or affiliates. If you are not the
intended recipient of this message, kindly notify the sender
immediately and destroy this message. Thank You.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message