nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Nutch 2.X question
Date Fri, 07 Nov 2014 02:37:24 GMT
Hi amit,

On Thu, Nov 6, 2014 at 1:54 PM, <dev-digest-help@nutch.apache.org> wrote:

> I have a small question about Nutch 2.X source code, i hope this is the
> right mailing list for
> that. i was unable to locate the following pieces from the code:
>
> a) where does the linkdb get generated, which java file contains the code
> for that
>

There is currently no independent linkdb data structure such as the opaque
object generated within Nutch 1.X.


>
> b) i see the WebPage class being utilized for remembering the pages that
> were
>   gathered.


Each URL is essentially a WebPage in Nutch 2.X. There is therefore one
WebPage for every document which is fetched by Nutch.


> It looks like the crawldb is a repository of these pages.


You are using Nutch 1.X and 2.X terminology here interchangeably I feel and
it is quite confusing. Nutch 2.X does in fact not have a crawldb either. It
delegates all such data structures to Gora, which is an object-to-datastore
mapping framework. Objects in Gora are associated with an Object store. In
Nutch both the WebPage store and Host stores are initialized within
StorageUtils
https://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java

So, I would say that the WebPage store is a datastore containing
collections of Nutch WebPage and/or Host objects.


> If that is
>   the case then:
>
>   -- it looks like WepPage remembers the contents of the page together
> with the
>     rest of the information about the page. How do we delete content which
> is
>     old and not changed for a while
>

You do not need to do this. If a WebPage is refetched after some duration
of time the content will be updated based on the new version.


>
> -- it does not appear that Nutch 2.X has any concept of segments.


Correct


> How do we
>     delete stuff that is older than 1 month so that we dont blow out the
> disk space ?
>

Well seeing as you have no segments, you don't need to delete anything. All
your data is flushed down into the datastore of your choice. Nutch 2.X does
not reply upon the opaque Hadoop sequence file data structure which work
within Nutch 1.X. Not having to maintain segments is one feature I suppose
of Nutch 2.X.


>    It seemed that Nutch 1.x had segments, and older segments were removable
>
>  Yes that is correct and also highly advised. Keeping an eye on your older
segments is something which everyone should do IMHO.
hth
Lewis

Mime
View raw message