nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: servlet
Date Wed, 23 Mar 2005 21:58:55 GMT
Doug Cutting wrote:
> Andrzej Bialecki wrote:
>> However, what I think is ultimately needed to match the features of 
>> other search engines is not the ability to return the cached non-html 
>> content (there might even be copyright issues with this function...), 
>> but an html rendering of non-html content, a la Google's "View as 
>> HTML" function.
> Why are copyright issues different for HTML than for other formats?

Because it is much less common to encounter a restrictive license on 
HTML than on other formats.

> I suspect that the original reason that Google did things this way was 
> not for copyright or usability, but rather to take advantage of their 
> HTML-related technology (e.g., boosting scores for headings, etc.) and 
> to minimize storage requirements.  If it was primarily a usability issue 
> then they could convert to html on the fly.  Rather it appears that 
> Google decided to convert everything to a common-denominator format 
> early in their pipeline, before the cache is written.  Nutch keeps a 
> higher-fidelity cache, which permits it to show the original content, as 
> well as any lower-fidelity renderings.

This is technically true. However, my point was that someone could treat 
this high-fidelity caching as unauthorized re-distribution of content 
covered by more restrictive licenses than HTML. Think e.g. about mp3, 
avi, and high-quality images, that although technically can be 
downloaded but their re-distribution is legally encumbered. If Nutch 
uses a lower quality copy in cache, then it's easy to defend against the 
accusations of abuse. However, if you can download basically the same 
content from Nutch's cache as from the original site, you could run into 

Google steers nicely around this legal problem by always providing the 
lower resolution content, and by clearly "stamping" the content so that 
it cannot be mistaken for the content coming from the original site.

That said, I think this functionality is good to have anyway, even if 
individual Nutch operators may decide not to display such content on 
their public sites.

Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message