nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Resolved] (NUTCH-314) Multiple language identifier instances
Date Sat, 12 Jan 2013 19:40:12 GMT


Lewis John McGibbney resolved NUTCH-314.

    Resolution: Won't Fix

close of legacy issue
> Multiple language identifier instances
> --------------------------------------
>                 Key: NUTCH-314
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: OS: Linux RHEL 4
> JDK: 1.5_07
>            Reporter: Enrico Triolo
> In my application I often need to perform the inject -> generate -> .. -> index
loop multiple times, since users can 'suggest' new web pages to be crawled and indexed.
> I also need to enable the language identifier plugin.
> Everything seems to work correctly, but after some time I get an OutOfMemoryException.
Actually the time isn't important, since I noticed that the problem arises when the user submits
many urls (~100). As I said, for each submitted url a new loop is performed (similar to the
one in the Crawl.main method).
> Using a profiler (specifically, netbeans profiler) I found out that for each submitted
url a new LanguageIdentifier instance is created, and never released. With the memory inspector
tool I can see as many instances of LanguageIdentifier and NGramProfile$NGramEntry as the
number of fetched pages, each of them occupying about 180kb. Forcing garbage collection doesn't
release much memory.
> Maybe we should cache its instance in the conf as we do for many others objects in Nutch.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message