nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kauu" <>
Subject RSS-fecter and index individul-how can i realize this function
Date Wed, 31 Jan 2007 02:01:36 GMT
Hi folks :

   What’s I want to do is to separate a rss file into several pages .

  Just as what has been discussed before. I want fetch a rss page and index
it as different documents in the index. So the searcher can search the
Item’s info as a individual hit.

 What’s my opinion create a protocol for fetch the rss page and store it as
several one which just contain one ITEM tag .but the unique key is the url ,
so how can I store them with the ITEM’s link tag as the unique key for a

  So my question is how to realize this function in nutch-.0.8.x. 

  I’ve check the code of the plug-in protocol-http’s code ,but I can’t
find the code where to store a page to a document. I want to separate the
rss page to several ones before storing it as a document but several ones.

  So any one can give me some hints?

Any reply will be appreciated !



  ITEM’s structure 


    <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>

    <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...


<> /n247833568.shtml</



    <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>




  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message