nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "AJ Chen (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-87) Efficient site-specific crawling for a large number of sites
Date Fri, 02 Sep 2005 20:24:12 GMT
Efficient site-specific crawling for a large number of sites
------------------------------------------------------------

         Key: NUTCH-87
         URL: http://issues.apache.org/jira/browse/NUTCH-87
     Project: Nutch
        Type: New Feature
  Components: fetcher  
 Environment: cross-platform
 Reporter: AJ Chen


There is a gap between whole-web crawling and single (or handful) site crawling. Many applications
actually fall in this gap, which usually require to crawl a large number of selected sites,
say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request
calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently
deal with large number of sites. One requirement is to add or change smallest amount of code
so that this feature can be implemented sooner rather than later. 

There is a discussion about adding a URLFilter to implement this requested feature, see the
following thread - 
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable
will be much faster than list implementation currently used in RegexURLFilter.  Fortunately,
Matt Kangas has implemented such idea before for his own application and is willing to make
it available for adaptation to Nutch. I'll be happy to help him in this regard.  

But, before we do it, we would like to hear more discussions or comments about this approach
or other approaches. Particularly, let us know what potential downside will be for hashtable
lookup in a new URLFilter plugin.

AJ Chen



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message