nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From amit sehas <>
Subject Nutch 2.X question
Date Tue, 04 Nov 2014 18:26:23 GMT

I have a small question about Nutch 2.X source code, i hope this is the right mailing list
that. i was unable to locate the following pieces from the code:

a) where does the linkdb get generated, which java file contains the code for that

b) i see the WebPage class being utilized for remembering the pages that were
  gathered. It looks like the crawldb is a repository of these pages. If that is
  the case then:

  -- it looks like WepPage remembers the contents of the page together with the
    rest of the information about the page. How do we delete content which is
    old and not changed for a while

 -- it does not appear that Nutch 2.X has any concept of segments. How do we
    delete stuff that is older than 1 month so that we dont blow out the disk space ?


View raw message