nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Susam Pal (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl
Date Fri, 15 Feb 2008 19:50:07 GMT
URL filtering is always disabled in Generator when invoked by Crawl
-------------------------------------------------------------------

                 Key: NUTCH-612
                 URL: https://issues.apache.org/jira/browse/NUTCH-612
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 1.0.0
            Reporter: Susam Pal
             Fix For: 1.0.0


When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator
even if 'crawl.generate.filter' is set to true in the configuration file.

The problem is that in the Generator's generate method, the following code unconditionally
sets the filter value of the job to whatever is passed to it:-

{code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code}

The code in Crawl.java always passes this as false. 

This has been fixed by exposing an overloaded generate method which takes only the 5 arguments
that Crawl needs to set. This overloaded method reads the configuration and sets the filter
value appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message