nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Wed, 31 Jan 2007 03:34:49 GMT
Hi there,

On 1/30/07 7:00 PM, "pdecrem@yahoo.com" <pdecrem@yahoo.com> wrote:

> Chris,
> 
> I saw your name associated with the rss parser in nutch.  My understanding is
> that nutch is using feedparser.  I had two questions:
> 
> 1.  Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

> 2.  Any view on asynchronous communication as the underlying protocol?  I do
> not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
  Chris


> 
> Thanks
>   
> 
> -----Original Message-----
> From: Chris Mattmann <chris.mattmann@jpl.nasa.gov>
> Date: Tue, 30 Jan 2007 18:16:44
> To:<nutch-dev@lucene.apache.org>
> Subject: Re: RSS-fecter and index individul-how can i realize this function
> 
> Hi there,
> 
>  I could most likely be of assistance, if you gave me some more information.
> For instance: I'm wondering if the use case you describe below is already
> supported by the current RSS parse plugin?
> 
>  The current RSS parser, parse-rss, does in fact index individual items that
> are pointed to by an RSS document. The items are added as Nutch Outlinks,
> and added to the overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below? Or am I missing something?
> 
> Cheers,
>   Chris
> 
> 
> 
> On 1/30/07 6:01 PM, "kauu" <babatu@gmail.com> wrote:
> 
>> Hi folks :
>> 
>>    What’s I want to do is to separate a rss file into several pages .
>> 
>>   Just as what has been discussed before. I want fetch a rss page and index
>> it as different documents in the index. So the searcher can search the
>> Item’s info as a individual hit.
>> 
>>  What’s my opinion create a protocol for fetch the rss page and store it as
>> several one which just contain one ITEM tag .but the unique key is the url ,
>> so how can I store them with the ITEM’s link tag as the unique key for a
>> document.
>> 
>>   So my question is how to realize this function in nutch-.0.8.x.
>> 
>>   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
>> find the code where to store a page to a document. I want to separate the
>> rss page to several ones before storing it as a document but several ones.
>> 
>>   So any one can give me some hints?
>> 
>> Any reply will be appreciated !
>> 
>>  
>> 
>>  
>> 
>>   ITEM’s structure
>> 
>>  <item>
>> 
>> 
>>     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
>> 
>> 
>>     <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
>> 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
>> 
>> 
>> 
>>     </description>
>> 
>> 
>>     <link>http://news.sohu.com/20070125
>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
>> link>
>> 
>> 
>>     <category>搜狐焦点图新闻</category>
>> 
>> 
>>     <author>cms@sohu.com
>> </author>
>> 
>> 
>>     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
>> 
>> 
>>     <comments
>>> http://comment.news.sohu.com
>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>> /comment/topic.jsp?id=247833847</comments>
>> 
>> 
>> </item
>> 
>>  
>> 
> 
> 



Mime
View raw message