nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gal Nitzan (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-100) New plugin urlfilter-db
Date Thu, 29 Sep 2005 20:38:47 GMT
New plugin urlfilter-db
-----------------------

         Key: NUTCH-100
         URL: http://issues.apache.org/jira/browse/NUTCH-100
     Project: Nutch
        Type: New Feature
  Components: fetcher  
    Versions: 0.8-dev    
 Environment: MapRed
    Reporter: Gal Nitzan
    Priority: Trivial


Hi,

I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but
to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end
a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use and domain field
from nutch-site.xml


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message