nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt MacDonald (JIRA)" <>
Subject [jira] [Created] (NUTCH-1468) Redirects that are external links not adhering to db.ignore.external.links
Date Sun, 09 Sep 2012 11:37:07 GMT
Matt MacDonald created NUTCH-1468:

             Summary: Redirects that are external links not adhering to db.ignore.external.links
                 Key: NUTCH-1468
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 2.1
            Reporter: Matt MacDonald
         Attachments: redirects-to-external.patch

Patch attached for this.


Likely this is a question for Ferdy but if anyone else has input
that'd be great. When running a crawl that I would expect to be
contained to a single domain I'm seeing the crawler jump out to other
domains. I'm using the trunk of Nutch 2.x which includes the following

The goal is to perform a focused crawl against a single domain and
restrict the crawler from expanding beyond that domain. I've set the
db.ignore.external.links property to true. I do not want to add a
regex to regex-urlfilter.txt as I will be adding several thousand
urls. The domain that I am crawling has documents with outlinks that
are still within the domain but then redirect to external domains.

cat urls/seed.txt

cat conf/nutch-site.xml
    <description>If true, outlinks leading from a page to external hosts
    will be ignored. This is an effective way to limit the crawl to include
    only initially injected hosts, without creating complex URLFilters.

   <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin. By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please enable
    protocol-httpclient, but be aware of possible intermittent
problems with the
    underlying commons-httpclient library.

bin/nutch crawl urls -depth 8 -topN 100000

results in the the crawl eventually fetching and parsing documents on
domains external to the only link in the seed.txt file.

I would not expect to see urls like the following in my logs and in
the HBase webpage table:


I'm reviewing the code changes but am still getting up to speed on the
code base. Any ideas while I continue to dig around? Configuration
issue or code?


This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message