lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincenzo D'Amore <v.dam...@gmail.com>
Subject Re: Memory Leak in 7.3 to 7.4
Date Thu, 02 Aug 2018 17:21:07 GMT
Does this script also saves a memory dump of jvm?

Ciao,
Vincenzo

--
mobile: 3498513251
skype: free.dev

> On 2 Aug 2018, at 17:53, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> Thomas:
> 
> You've obviously done a lot of work to track this, but maybe you can
> do even more ;).
> 
> Here's a link to a program that uses Tika to parse docs _on the client_:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> If you take out all the DB and Solr parts, you're left with something
> that just parses docs with Tika. My idea here is to feed it your docs
> and see if there are these noticeable memory differences between the
> versions of Tika.  A heap dump if there are would help the Tika folks
> enormously in tracking this down.
> 
> And if there's no memory creep, that points toward the glue code in Solr.
> 
> I also have to add that this kind of thing is one of the reasons we
> generally recommend that production systems do not use
> ExtractingRequestHandler. There are other reasons outlined in the link
> above....
> 
> Best,
> Erick
> 
> On Thu, Aug 2, 2018 at 4:30 AM, Thomas Scheffler
> <thomas.scheffler@uni-jena.de> wrote:
>> Hi,
>> 
>> my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries just for
tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage after about 85 % of the index
process and manual trigger of the garbage collector is about 60-70 MB (That low!!!)
>> 
>> My problem now is that we have several setups that triggers this reliably but there
is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. I also do not know if
the error is inside Tika or inside the glue code that makes Tika usable in SOLR.
>> 
>> Should I file an issue for this?
>> 
>> kind regards,
>> 
>> Thomas
>> 
>> 
>>> Am 02.08.2018 um 12:06 schrieb Thomas Scheffler <thomas.scheffler@uni-jena.de>:
>>> 
>>> Hi,
>>> 
>>> we noticed a memory leak in a rather small setup. 40.000 metadata documents with
nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some
tika issues (due to a beta version) the real problems started to appear with version 7.3.0
which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously
512MB heap was enough, now 6G aren’t enough to index all files.
>>> I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/.
If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most
files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution
to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16
to see if these are the source of the memory leak.
>>> 
>>> In the mean time I would like to know if anybody else experienced the same problems?
>>> 
>>> kind regards,
>>> 
>>> Thomas
>> 
>> 

Mime
View raw message