nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled
Date Thu, 06 Apr 2017 09:05:41 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958608#comment-15958608
] 

Sebastian Nagel commented on NUTCH-2365:
----------------------------------------

See also [thread on user mailing list|https://lists.apache.org/thread.html/7d244daeb29dbbfcbb6e792a0d6cf861fd2829940b873d2acd695862@%3Cuser.nutch.apache.org%3E]

> HTTP Redirects to SubDomains don't get crawled
> ----------------------------------------------
>
>                 Key: NUTCH-2365
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2365
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.12
>         Environment: Fedora 25
>            Reporter: Sriram Nookala
>             Fix For: 1.14
>
>
> Crawling a domain  http://www.mercenarytrader.com which redirects to https://members.mercenarytrader.com
which doesn't get followed by Nutch even though 'db.ignore.external.links' is set to 'true'
and 'db.ignore.external.links.mode' is set to 'byDomain'. 
>   The bug is in FetcherThread where the comparison is by host and not by domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> >       String newHost = new URL(newUrl).getHost().toLowerCase();
> >       if (ignoreExternalLinks) {
> >         if (!origHost.equals(newHost)) {
> >           if (LOG.isDebugEnabled()) {
> >             LOG.debug(" - ignoring redirect " + redirType + " from "
> >                 + urlString + " to " + newUrl
> >                 + " because external links are ignored");
> >           }
> >           return null;
> >         }
> >       }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message