nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kauu" <bab...@gmail.com>
Subject RSS-fecter and index individul-how can i realize this function
Date Wed, 31 Jan 2007 02:01:36 GMT
Hi folks :

   What’s I want to do is to separate a rss file into several pages .

  Just as what has been discussed before. I want fetch a rss page and index
it as different documents in the index. So the searcher can search the
Item’s info as a individual hit.

 What’s my opinion create a protocol for fetch the rss page and store it as
several one which just contain one ITEM tag .but the unique key is the url ,
so how can I store them with the ITEM’s link tag as the unique key for a
document.

  So my question is how to realize this function in nutch-.0.8.x. 

  I’ve check the code of the plug-in protocol-http’s code ,but I can’t
find the code where to store a page to a document. I want to separate the
rss page to several ones before storing it as a document but several ones.

  So any one can give me some hints?

Any reply will be appreciated !

 

 

  ITEM’s structure 

 <item>


    <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>


    <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...



    </description>


    <link>http://news.sohu.com/20070125
<http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
link>


    <category>搜狐焦点图新闻</category>


    <author>cms@sohu.com
</author>


    <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>


    <comments
>http://comment.news.sohu.com
<http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
/comment/topic.jsp?id=247833847</comments>


</item

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message