nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: servlet
Date Wed, 23 Mar 2005 17:22:57 GMT
Andrzej Bialecki wrote:
> However, what I think is ultimately needed to match the features of 
> other search engines is not the ability to return the cached non-html 
> content (there might even be copyright issues with this function...), 
> but an html rendering of non-html content, a la Google's "View as HTML" 
> function.

Why are copyright issues different for HTML than for other formats?

I suspect that the original reason that Google did things this way was 
not for copyright or usability, but rather to take advantage of their 
HTML-related technology (e.g., boosting scores for headings, etc.) and 
to minimize storage requirements.  If it was primarily a usability issue 
then they could convert to html on the fly.  Rather it appears that 
Google decided to convert everything to a common-denominator format 
early in their pipeline, before the cache is written.  Nutch keeps a 
higher-fidelity cache, which permits it to show the original content, as 
well as any lower-fidelity renderings.

If we someday index, e.g., headings and bolded text specially then we 
may find it useful to have a common-denominator intermediate format, 
like html, that all content types are converted to.  But until we do, I 
don't see much point in caching an HTML representation.


View raw message