nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-2456) Redirected documents are not indexed
Date Tue, 07 Nov 2017 16:03:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242247#comment-16242247
] 

Sebastian Nagel edited comment on NUTCH-2456 at 11/7/17 4:02 PM:
-----------------------------------------------------------------

For every item in a redirect chain  URL -> target_1 -> target_2 -> target_n, a new
CrawlDatum is created and stored in the segment's crawl_fetch directory.  After running "updatedb"
these CrawlDatum's are added to the CrawlDb, and an index job will get them as input. Only
if the CrawlDb isn't updated (or this is done with -noAdditions) before indexing this may
happen. Is this a possible reason? In doubt, are you able to share more details?


was (Author: wastl-nagel):
For every item in a redirect chain  URL -> target_1 -> target_2 -> target_n, a new
CrawlDatum is created and stored in the segment.  After running "updatedb" these CrawlDatum's
are added to the CrawlDb, and an index job will get them as input. Only if the CrawlDb isn't
updated (or this is done with -noAdditions) before indexing. Is this a possible reason?

> Redirected documents are not indexed
> ------------------------------------
>
>                 Key: NUTCH-2456
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2456
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Critical
>
> If http.redirect.max is set to a positive value, the Fetcher will follow redirects, creating
a new CrawlDatum.
> If the redirected URL is fetched and parsed, during indexing for it we have a special
case: dbDatum is null. This means that in [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
the document is not indexed, as it is assumed it only has inlinks (actually it has everything
but dbDatum).
> I'm not sure what the correct fix is here. It seems to me the condition should use AND
instead of OR anyway, but I may not understand the original intent. It is clear that it is
too strict as is.
> However, the code following that line assumes all 4 objects are not null, so a patch
would need to change more than just the condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message