nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yossi Tamari (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2456) Redirected documents are not indexed
Date Mon, 06 Nov 2017 17:54:00 GMT
Yossi Tamari created NUTCH-2456:
-----------------------------------

             Summary: Redirected documents are not indexed
                 Key: NUTCH-2456
                 URL: https://issues.apache.org/jira/browse/NUTCH-2456
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.13
            Reporter: Yossi Tamari
            Priority: Critical


If http.redirect.max is set to a positive value, the Fetcher will follow redirects, creating
a new CrawlDatum.
If the redirected URL is fetched and parsed, during indexing for it we have a special case:
dbDatum is null. This means that in [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
the document is not indexed, as it is assumed it only has inlinks (actually it has everything
but dbDatum).
I'm not sure what the correct fix is here. It seems to me the condition should use AND instead
of OR anyway, but I may not understand the original intent. It is clear that it is too strict
as is.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message