nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Nutch 6.1 running issu
Date Sat, 10 Sep 2005 20:15:29 GMT
Michael Ji wrote:
> "
> FetchListEntry value = new FetchListEntry();
> Page page = (Page)value.getPage().clone();
> "
> 
> Seems value is an empty FetchListEntry instance. Will
> that cause clone getPage failure coz it is NULL?

Please try to replace this logic with the following:

                 FetchListEntry value = new FetchListEntry();
                 while (topN > 0 && reader.next(key, value)) {
                   Page page = value.getPage();
                   if (page != null) {
                     Page p = new Page();
                     p.set(page);
                     page = p;
                   }
                     if (forceRefetch) {
                       Page p = value.getPage();
                       // reset fetchTime and MD5, so that the content will
                       // always be new and unique.
                       p.setNextFetchTime(0L);
                       p.setMD5(MD5Hash.digest(p.getURL().toString()));
                     }
                     tables.append(value);
                     topN--;


This patchset still needs a lot of thought and work. Even the part that 
avoids re-fetching unmodified content needs additional thinking - it's 
easy to end up in a state, where Nutch cannot be forced to re-fetch the 
page because every time you try it remains unmodified - but you need 
refetching the actual data because e.g. you lost that segment data...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message