lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Memory Leak in 7.3 to 7.4
Date Thu, 02 Aug 2018 15:53:42 GMT
Thomas:

You've obviously done a lot of work to track this, but maybe you can
do even more ;).

Here's a link to a program that uses Tika to parse docs _on the client_:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

If you take out all the DB and Solr parts, you're left with something
that just parses docs with Tika. My idea here is to feed it your docs
and see if there are these noticeable memory differences between the
versions of Tika.  A heap dump if there are would help the Tika folks
enormously in tracking this down.

And if there's no memory creep, that points toward the glue code in Solr.

I also have to add that this kind of thing is one of the reasons we
generally recommend that production systems do not use
ExtractingRequestHandler. There are other reasons outlined in the link
above....

Best,
Erick

On Thu, Aug 2, 2018 at 4:30 AM, Thomas Scheffler
<thomas.scheffler@uni-jena.de> wrote:
> Hi,
>
> my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries just for tika
back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage after about 85 % of the index
process and manual trigger of the garbage collector is about 60-70 MB (That low!!!)
>
> My problem now is that we have several setups that triggers this reliably but there is
no simple test case that „fails“ if Tika 1.17 or 1.18 is used. I also do not know if the
error is inside Tika or inside the glue code that makes Tika usable in SOLR.
>
> Should I file an issue for this?
>
> kind regards,
>
> Thomas
>
>
>> Am 02.08.2018 um 12:06 schrieb Thomas Scheffler <thomas.scheffler@uni-jena.de>:
>>
>> Hi,
>>
>> we noticed a memory leak in a rather small setup. 40.000 metadata documents with
nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some
tika issues (due to a beta version) the real problems started to appear with version 7.3.0
which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously
512MB heap was enough, now 6G aren’t enough to index all files.
>> I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/.
If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most
files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution
to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16
to see if these are the source of the memory leak.
>>
>> In the mean time I would like to know if anybody else experienced the same problems?
>>
>> kind regards,
>>
>> Thomas
>
>

Mime
View raw message