nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (NUTCH-708) NutchBean: OOM due to searcher.max.hits and dedup.
Date Fri, 01 Apr 2011 14:37:07 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma closed NUTCH-708.
-------------------------------

    Resolution: Won't Fix

> NutchBean: OOM due to searcher.max.hits and dedup.
> --------------------------------------------------
>
>                 Key: NUTCH-708
>                 URL: https://issues.apache.org/jira/browse/NUTCH-708
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>         Environment: Ubuntu Linux, Java 5.
>            Reporter: Aaron Binns
>
> When searching an index we built for the National Archives, this one in particular: http://webharvest.gov/collections/congress110th/
> We ran into an interesting situation.
> We were using searcher.max.hits=1000 in order to get faster searches.  Since our index
is sorted, the "best" documents are "at the front" and setting searcher.max.hits=1000 would
give us a nice trade-off of search quality vs. response time.
> What I discovered was that with dedup (on site) enabled, we would get into this loop
where the searcher.max.hits would limit the raw hits to 1000 and the deduplication code would
get to the end of those 1000 results and still need more as it hadn't found enough de-dup'd
results to satisfy the query.
> The first 6 pages of results would be fine, but when we got to page 7, the NutchBean
would need more than 1000 raw results in order to get 60 de-duped results.
> The code:
>     for (int rawHitNum = 0; rawHitNum < hits.getTotal(); rawHitNum++) {
>       // get the next raw hit                                                       
                                                                                         
                                  
>       if (rawHitNum >= hits.getLength())
>         {
>         // optimize query by prohibiting more matches on some excluded values       
                                                                                         
                                  
>         Query optQuery = (Query)query.clone();
>         for (int i = 0; i < excludedValues.size(); i++) {
>           if (i == MAX_PROHIBITED_TERMS)
>             break;
>           optQuery.addProhibitedTerm(((String)excludedValues.get(i)),
>                                      dedupField);
>         }
>         numHitsRaw = (int)(numHitsRaw * rawHitsFactor);
>         if (LOG.isInfoEnabled()) {
>           LOG.info("re-searching for "+numHitsRaw+" raw hits, query: "+optQuery);
>         }
>         hits = searcher.search(optQuery, numHitsRaw,
>                                dedupField, sortField, reverse);
>         if (LOG.isInfoEnabled()) {
>           LOG.info("found "+hits.getTotal()+" raw hits");
>         }
>         rawHitNum = -1;
>         continue;
>       }
> The loop constraints were never satisfied as rawHitNum and hits.getLength() are capped
by searcher.max.hits (1000).  The numHitsRaw keeps increasing by a factor of 2 (rawHitsFactor)
until it gets to 2^31 or so and deep down in the search library code an array is allocated
using that value as the size and you get an OOM.
> We worked around the problem by abandoning the use of searcher.max.hits.  I suppose we
could have increased the value, but the index was small enough (~10GB) that disabling searcher.max.hits
didn't degrade the response time too much.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message