nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kauu <bab...@gmail.com>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Wed, 31 Jan 2007 12:37:35 GMT
hi ,
thx any way , but i don't think I tell clearly enough.

what i want  is nutch  just fetch  rss seeds for 1 depth. So  nutch should
just  fetch some xml pages .I don't want to fetch the items' outlink 's
pages, because there r too much spam in those pages.
  so , i just need to parse the rss file.
 so when i search some words which in description tag in one xml's item. the
return hit will be like this
title ==one item's title
summary ==one item's description
link ==one itme's outlink.

so , i don't know whether the parse-rss plugin provide this function?

On 1/31/07, Chris Mattmann <chris.mattmann@jpl.nasa.gov> wrote:
>
> Hi there,
>
>   With the explanation that you give below, it seems like parse-rss as it
> exists would address what you are trying to do. parse-rss parses an RSS
> channel as a set of items, and indexes overall metadata about the RSS
> file,
> including parse text, and index data, but it also adds each item (in the
> channel)'s URL as an Outlink, so that Nutch will process those pieces of
> content as well. The only thing that you suggest below that parse-rss
> currently doesn't do, is to allow you to associate the metadata fields
> category:, and author: with the item Outlink...
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 7:30 PM, "kauu" <babatu@gmail.com> wrote:
>
> > thx for ur reply .
> mybe i didn't tell clearly .
> I want to index the item as a
> > individual page .then when i search the some
> thing for example "nutch-open
> > source", the nutch return a hit which contain
>
>    title : nutch-open source
>
> > description : nutch nutch nutch ....nutch  nutch
>    url :
> > http://lucene.apache.org/nutch
>    category : news
>   author  : kauu
>
> so , is
> > the plugin parse-rss can satisfy what i need?
>
> <item>
>     <title>nutch--open
> > source</title>
>    <description>
> >
> >        nutch nutch nutch ....nutch
> > nutch
> > >     </description>
> > >
> > >
> > >
> > <link>http://lucene.apache.org/nutch</link>
> > >
> > >
> > >     <category>news
> > </category>
> > >
> > >
> > >     <author>kauu</author>
>
>
>
> On 1/31/07, Chris
> > Mattmann <chris.mattmann@jpl.nasa.gov> wrote:
> >
> > Hi there,
> >
> > I could most
> > likely be of assistance, if you gave me some more
> > information.
> > For
> > instance: I'm wondering if the use case you describe below is already
> >
> > supported by the current RSS parse plugin?
> >
> > The current RSS parser,
> > parse-rss, does in fact index individual items
> > that
> > are pointed to by an
> > RSS document. The items are added as Nutch Outlinks,
> > and added to the
> > overall queue of URLs to fetch. Doesn't this satisfy what
> > you mention below?
> > Or am I missing something?
> >
> > Cheers,
> >   Chris
> >
> >
> >
> > On 1/30/07 6:01 PM,
> > "kauu" <babatu@gmail.com> wrote:
> >
> > > Hi folks :
> > >
> > >    What's I want to
> > do is to separate a rss file into several pages .
> > >
> > >   Just as what has
> > been discussed before. I want fetch a rss page and
> > index
> > > it as different
> > documents in the index. So the searcher can search the
> > > Item's info as a
> > individual hit.
> > >
> > >  What's my opinion create a protocol for fetch the rss
> > page and store it
> > as
> > > several one which just contain one ITEM tag .but
> > the unique key is the
> > url ,
> > > so how can I store them with the ITEM's link
> > tag as the unique key for a
> > > document.
> > >
> > >   So my question is how to
> > realize this function in nutch-.0.8.x.
> > >
> > >   I've check the code of the
> > plug-in protocol-http's code ,but I can't
> > > find the code where to store a
> > page to a document. I want to separate
> > the
> > > rss page to several ones
> > before storing it as a document but several
> > ones.
> > >
> > >   So any one can
> > give me some hints?
> > >
> > > Any reply will be appreciated !
> > >
> > >
> > >
> > >
> >
> > >
> > >   ITEM's structure
> > >
> > >  <item>
> > >
> > >
> > >     <title>欧洲暴风雪后发制人 致航班
> > 延误交通混乱(组图)</title>
> > >
> > >
> > >     <description>暴风雪横扫欧洲,导致多次航班延误 1
> > 月24日,几架民航客机在德
> > > 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> > 的慕尼黑机场
> > > 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> > >
> >
> > >
> > >
> > >     </description>
> > >
> > >
> > >
> > <link>http://news.sohu.com/20070125
> > >
> > <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> > >
> > link>
> > >
> > >
> > >     <category>搜狐焦点图新闻</category>
> > >
> > >
> > >
> > <author>cms@sohu.com
> > > </author>
> > >
> > >
> > >     <pubDate>Thu, 25 Jan 2007
> > 11:29:11 +0800</pubDate>
> > >
> > >
> > >     <comments
> > >>
> > http://comment.news.sohu.com
> > >
> > <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> > >
> > /comment/topic.jsp?id=247833847</comments>
> > >
> > >
> > > </item
> > >
> > >
> >
> > >
> >
> >
> >
>
>
> --
> www.babatu.com
>
>
>
>


-- 
www.babatu.com
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message