nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pdec...@yahoo.com
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Wed, 31 Jan 2007 03:00:32 GMT
Chris,

I saw your name associated with the rss parser in nutch.  My understanding is that nutch is
using feedparser.  I had two questions:

1.  Have you looked at vtd as an rss parser? 
2.  Any view on asynchronous communication as the underlying protocol?  I do not believe that
feedparser uses that at this point.

Thanks
  

-----Original Message-----
From: Chris Mattmann <chris.mattmann@jpl.nasa.gov>
Date: Tue, 30 Jan 2007 18:16:44 
To:<nutch-dev@lucene.apache.org>
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

 I could most likely be of assistance, if you gave me some more information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

 The current RSS parser, parse-rss, does in fact index individual items that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
  Chris



On 1/30/07 6:01 PM, "kauu" <babatu@gmail.com> wrote:

> Hi folks :
> 
>    What’s I want to do is to separate a rss file into several pages .
> 
>   Just as what has been discussed before. I want fetch a rss page and index
> it as different documents in the index. So the searcher can search the
> Item’s info as a individual hit.
> 
>  What’s my opinion create a protocol for fetch the rss page and store it as
> several one which just contain one ITEM tag .but the unique key is the url ,
> so how can I store them with the ITEM’s link tag as the unique key for a
> document.
> 
>   So my question is how to realize this function in nutch-.0.8.x.
> 
>   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
> find the code where to store a page to a document. I want to separate the
> rss page to several ones before storing it as a document but several ones.
> 
>   So any one can give me some hints?
> 
> Any reply will be appreciated !
> 
>  
> 
>  
> 
>   ITEM’s structure
> 
>  <item>
> 
> 
>     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
> 
> 
>     <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
> 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
> 
> 
> 
>     </description>
> 
> 
>     <link>http://news.sohu.com/20070125
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> link>
> 
> 
>     <category>搜狐焦点图新闻</category>
> 
> 
>     <author>cms@sohu.com
> </author>
> 
> 
>     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
> 
> 
>     <comments
>> http://comment.news.sohu.com
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> /comment/topic.jsp?id=247833847</comments>
> 
> 
> </item
> 
>  
> 


Mime
View raw message