nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sdeck <>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Thu, 08 Feb 2007 23:00:11 GMT

So, here is what I do for RSS Feeds.

I parse the rss, and for each outlink, I create the outlink object and set
inside the anchor text for each outlink a well formed xml string. It
contains the pub date, description, etc. Now, this is only because I was
hacking the outlink to just use it's anchor text, but you could always just
create a new MetaData object for use with an outlink. So, then next time
that url is called up, and you then get an html parser, then you could look
at the outlinks metadata and say, hey, look you came from an rss feed. So, I
can either just use your stored Metadata and not parse the html, or I could
combine your meta data with what comes from the html, etc.
I have found that to be the best solutions

Also, when I parse the rss feed, I set a meat tag called "noindex", so in my
basic indexer, if that is in there, I do not include the rss feed page in
the Lucene index.


Doug Cutting wrote:
> Chris Mattmann wrote:
>>  Got it. So, the logic behind this is, why bother waiting until the
>> following fetch to parse (and create ParseData objects from) the RSS
>> items
>> out of the feed. Okay, I get it, assuming that the RSS feed has *all* of
>> the
>> RSS metadata in it. However, it's perfectly acceptable to have feeds that
>> simply have a title, description, and link in it.
> Almost.  The feed may have less than the referenced page, but it's also 
> a lot easier to parse, since the link could be an anchor within a large 
> page, or could be a page that has lots of navigation links, spam 
> comments, etc.  So feed entries are generally much more precise than the 
> pages they reference, and may make for a higher-quality search experience.
>> I guess this is still
>> valuable metadata information to have, however, the only caveat is that
>> the
>> implication of the proposed change is:
>> 1. We won't have cached copies, or fetched copies of the Content
>> represented
>> by the item links. Therefore, in this model, we won't be able to pull up
>> a
>> Nutch cache of the page corresponding to the RSS item, because we are
>> circumventing the fetch step
> Good point.  We indeed wouldn't have these URLs in the cache.
>> 2. It sounds like a pretty fundamental API shift in Nutch, to support a
>> single type of content, RSS. Even if there are more content types that
>> follow this model, as Doug and Renaud both pointed out, there aren't a
>> multitude of them (perhaps archive files, but can you think of any
>> others)?
> Also true.  On the other hand, Nutch provides 98% of an RSS search 
> engine.  It'd be a shame to have to re-invent everything else and it 
> would be great if Nutch could evolve to support RSS well.
> Could image search might also benefit from this?  One could generate a 
> Parse for each image on a page whose text was from the page.  Product 
> search too, perhaps.
>> The other main thing that comes to mind about this for me is it prevents
>> the
>> fetched Content for the RSS items from being able to provide useful
>> metadata, in the sense that it doesn't explicitly fetch the content. What
>> if
>> we wanted to apply some super cool metadata extractor X that used
>> word-stemming, HTML design analysis, and other techniques to extract
>> metadata from the content pointed to by an RSS item link? In the proposed
>> model, we assume that the RSS xml item tag already contains all necessary
>> metadata for indexing, which in my mind, limits the model. Does what I am
>> saying make sense? I'm not shooting down the issue, I'm just trying to
>> brainstorm a bit here about the issue.
> Sure, the RSS feed may contain less than the page it references, but 
> that might be all that one wishes to index.  Otherwise, if, e.g., a blog 
>   includes titles from other recent posts you're going to get lots of 
> false positives.  Ideally Nutch should support various options: 
> searching the feed only, searching the referenced page only, or perhaps 
> searching both.
> Doug

View this message in context:
Sent from the Nutch - Dev mailing list archive at

View raw message