nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marko Bauhardt (JIRA)" <>
Subject [jira] Updated: (NUTCH-249) black- white list url filtering
Date Wed, 29 Jul 2009 09:04:14 GMT


Marko Bauhardt updated NUTCH-249:

    Attachment: bw.patch

i have updated the patch to the nutch src code release-1.0 (
i hope license header and code formatting is ok.

Here a usage how you can use the black white filter.

Create a start url file for the crawldb e.g in /tmp/urls/start/urls.txt, for example

Create a limit url file for the bwdb e.g in /tmp/urls/limit/urls.txt, for example

Create a exclude url file for the bwdb e.g in /tmp/urls/exclude/urls.txt, for example

bin/nutch inject crawldb /tmp/urls/start/
bin/nutch bwdb /tmp/urls/limit -white
bin/nutch bwdb /tmp/urls/exclude -black
bin/nutch generate crawldb segments
bin/nutch fetch segments/20090729103233/
bin/nutch crawldb/ bwdb/ segments/20090729103233/ true
false (Usage: <crawldb> <bwdb> <segment> <normalize> <filter>)

Check your crawldb. It should contains only urls starting with "",
but not the url "".

> black- white list url filtering
> -------------------------------
>                 Key: NUTCH-249
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>            Assignee: Dennis Kubes
>            Priority: Trivial
>             Fix For: 1.1
>         Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch
> Existing url filter mechanisms need to process each url against each filter pattern.
For very large filter sets this may be does not scale very well.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message