nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nadav hashimshony" <nad...@gmail.com>
Subject Re: read crawldb.
Date Sun, 03 Feb 2008 08:43:05 GMT
Thanks you all for you help.

Nadav.


On Jan 31, 2008 10:04 AM, Siddhartha Reddy <sids@grok.in> wrote:

> Thank you for correcting me.
>
> I was not suggesting that one could retrieve the content from the crawldb.
> I
> was trying to suggest that to retrieve the content from the segment's
> 'content' directory, one could do something similar to what CrawlDbReader
> does. I guess I did not express this very clearly.
>
> I was not aware this could be done with SegmentReader, I should have
> looked
> harder.
>
> Best,
> Siddhartha
>
> On Jan 31, 2008 1:22 PM, Andrzej Bialecki <ab@getopt.org> wrote:
>
> > Siddhartha Reddy wrote:
> > > On looking further I think it might be possible to get the content
> given
> > a
> > > URL, but there is no existing class in nutch that can do this. Have a
> > look
> > > at the CrawlDbReader code (particularly the 'readUrl' function), you
> > will
> > > want to do something similar.
> >
> > This is not the case. First of all, CrawlDb doesn't keep the content of
> > pages, it's the segments that do this. CrawlDb just keeps the
> > information needed to schedule a crawl of this page and its status. You
> > can use CrawlDbReader to get the information about any particular url in
> > the database.
> >
> > Also, you can use the existing class SegmentReader to retrieve the full
> > content of a page from a specific segment, where it was crawled. Use
> > bin/nutch readseg -get <segment> <url>
> >
> > >
> > > If you want to write a mapred job that will give you access to the
> > content
> > > of the page, I think the Indexer class is a good starting point.
> > > http://nutch.grok.in/wiki/images/ProcessContent.java is a simple
> > map-reduce
> > > job I have written (based on Indexer) that makes the content as well
> as
> > the
> > > metadata of each downloaded page available in the reduce task.
> >
> > This is not needed, unless the above method doesn't give you the data in
> > a format that you want, but in that case, in order to retrieve
> > individual records, a much quicker way is to use the method implemented
> > in SegmentReader.get() - you can use this API in your code too.
> >
> > Map-reduce job would be preferred if you wanted to calculate some
> > aggregate values from the segment, or if you wanted to process many
> > records.
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>
>
> --
> http://sids.in
> "If you are not having fun, you are not doing it right."
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message