nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Padiasek (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1746) OutOfMemoryError in Mappers
Date Wed, 21 May 2014 02:56:41 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Greg Padiasek updated NUTCH-1746:
---------------------------------

    Attachment: ObjectCache.patch

Indeed, after investigating more I found that the problem is in ObjectCache or strictly speaking
in how it is being used. It turns out that ObjectCache.get() is called with multiple copies
of Configuration which results in creating multiple copies of filters.

I was able to avoid OOM exception in all mappers by changing ObjectCache to use Configuration.toString()
as CACHE key instead of Configuration. Changing CACHE into an instance of ObjectCache (that
is common for all Configuration) also works, but in this case weak references are eliminated
and the CACHE is never cleared. For that reason the first approach might be better. 

More investigation might reveal why multiple Configuration are being passed to ObjectCache,
but for the time being I am using a modified ObjectCache (patch attached).


> OutOfMemoryError in Mappers
> ---------------------------
>
>                 Key: NUTCH-1746
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1746
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator, injector
>    Affects Versions: 1.7
>         Environment: Nutch running in local mode with 4M+ domains in domain-urlfilter.txt
>            Reporter: Greg Padiasek
>         Attachments: Generator.patch, Injector.patch, ObjectCache.patch, domain-urlfilter-aa,
domain-urlfilter-ab, domain-urlfilter-ac
>
>
> Initially I found that Generator was throwing OutOfMemoryError exception no matter how
much RAM I allocated to JVM. I fixed the problem by moving URLFilters, URLNormalizers and
ScoringFilters to top-level class as singletons and re-using them in all Generator mapper
instances.
> Then I found the same problem in Injector and applied analogical fix.
> Now it seems that this issue may be common in all Nutch Mapper implementations.
> I was wondering if it would it be possible to integrate this kind of change
> in the upstream code base and potentially update all vulnerable Mapper classes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message