nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Susam Pal (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl
Date Fri, 15 Feb 2008 19:54:08 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Susam Pal updated NUTCH-612:
----------------------------

    Attachment: NUTCH-612v0.1.patch

Attached patch to fix the bug. This modifies Crawl.java and Generator.java.

> URL filtering is always disabled in Generator when invoked by Crawl
> -------------------------------------------------------------------
>
>                 Key: NUTCH-612
>                 URL: https://issues.apache.org/jira/browse/NUTCH-612
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-612v0.1.patch
>
>
> When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator
even if 'crawl.generate.filter' is set to true in the configuration file.
> The problem is that in the Generator's generate method, the following code unconditionally
sets the filter value of the job to whatever is passed to it:-
> {code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code}
> The code in Crawl.java always passes this as false. 
> This has been fixed by exposing an overloaded generate method which takes only the 5
arguments that Crawl needs to set. This overloaded method reads the configuration and sets
the filter value appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message