nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <>
Subject [jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak
Date Mon, 21 Aug 2006 22:06:15 GMT
    [ ] 
Chris A. Mattmann commented on NUTCH-356:

-1 for closing this issue.

If there is a demonstrable memory leak in the plugin system, then I think it should be remedied.
I haven't ran your test code, Enrico, nor experienced your problem before, but it would seem
that this issue is worth investigating. 

> Plugin repository cache can lead to memory leak
> -----------------------------------------------
>                 Key: NUTCH-356
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Enrico Triolo
>         Attachments:, patch.txt
> While I was trying to solve a problem I reported a while ago (see Nutch-314), I found
out that actually the problem was related to the plugin cache used in class
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since
I need to frequently submit new urls and append their contents to the index; I don't (and
I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time
a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch as-is, since
the problem I found occours only if nutch is used in a way similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar to what I
need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample
urls list empty, so you should modify the source code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In my experience
I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap
settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it never get
released. It seems that some class maintains a reference to it and this class is never released
since it is cached somewhere in the configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the cache and
always returns a new instance (you can find the patch in attachment). This way the memory
consumption is always stable and I get no OOM anymore.
> Clearly this is not the solution, since I guess there are many performance issues involved,
but for the moment it works.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message