nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Wed, 07 Feb 2007 19:11:17 GMT
Chris Mattmann wrote:
>  Sorry to be so thick-headed, but could someone explain to me in really
> simple language what this change is requesting that is different from the
> current Nutch API? I still don't get it, sorry...

A Content would no longer generate a single Parse.  Instead, a Content 
could potentially generate many Parses.  For most types of content, 
e.g., HTML, each Content would still generate a single Parse.  But for 
RSS, a Content might generate multiple Parses, each indexed separately 
and each with a distinct URL.

Another potential application could be processing archives: the parser 
could unpack the archive and each item in it indexed separately rather 
than indexing the archive as a whole.  This only makes sense if each 
item has a distinct URL, which it does in RSS, but it might not in an 
archive.  However some archive file formats do contain URLs, like that 
used by the Internet Archive.

Does that help?


View raw message