nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Nutch 6.1 running issu
Date Mon, 12 Sep 2005 06:41:46 GMT
Michael Ji wrote:
> hi Andrzej:
> 
> Thanks for your correction. The patch is compiled
> successfully and running well in Nutch 07.
> 
> Just a curious question:
> 
> As stated in nutch 61:
> "...if content is unmodified it doesn't have to be
> fetched and processed..."
> 
> And I did test for refetching a page without content
> modification and Nutch 6.1 DID parsing this page to
> content/, parse_data/, and parse_text/
> 

Are you sure the plugin retrieved the page content once again from the 
server? Because I use "If-Modified-Since", which means that if the 
content is unmodified the server should NOT send the page once again, 
just a status 304.

> I took look at code: 
> 
> In Fetcher.java, 
> "
> ProtocolOutput output =
> protocol.getProtocolOutput(fle);
> ProtocolStatus pstat = output.getStatus();
> :
> switch ( pstat ) {
> :
> :
>     case ProtocolStatus.NOTMODIFIED:                
>          handleFetch(fle, output); 
>     break;
> :
> :
> }
> "
> 
> Should we just do nothing in case of NOTMODIFIED,
> which is the flag set when content.MD5 = page.MD5 in
> protocol.http.java?
> 

We can't do nothing - we need to report the status. Even when we report 
an error, an additional record is written to segments...

> The handleFetch() actually parsing and output data
> structure to segments/.

Yes, that's correct - this was a conscious decision. The reason is that 
the server may return other interesting information in headers, which 
some of the parsing plugins or FetchSchedule implementations may need.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message