nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb
Date Wed, 08 Nov 2017 16:27:00 GMT


ASF GitHub Bot commented on NUTCH-2456:

sebastian-nagel commented on a change in pull request #240: NUTCH-2456 - Redirected documents
are not indexed

 File path: src/java/org/apache/nutch/indexer/
 @@ -238,38 +238,37 @@ public void reduce(Text key, Iterator<NutchWritable> values,
     // Whether to delete GONE or REDIRECTS
-    if (delete && fetchDatum != null && dbDatum != null) {
-      if (fetchDatum.getStatus() == CrawlDatum.STATUS_FETCH_GONE
-          || dbDatum.getStatus() == CrawlDatum.STATUS_DB_GONE) {
+    if (delete) {
 Review comment:
   Yes, but the indexer takes also the full CrawlDb (and optionally LinkDb) as input. In a
crawl with many cycles the CrawlDb may become significantly bigger than the currently processed
segment(s). There is no way to filter CrawlDb items without counter-parts in the segments
beforehand, all end up in the reducer.

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Allow to index pages/URLs not contained in CrawlDb
> --------------------------------------------------
>                 Key: NUTCH-2456
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Critical
> If http.redirect.max is set to a positive value, the Fetcher will follow redirects, creating
a new CrawlDatum.
> If the redirected URL is fetched and parsed, during indexing for it we have a special
case: dbDatum is null. This means that in []
the document is not indexed, as it is assumed it only has inlinks (actually it has everything
but dbDatum).
> I'm not sure what the correct fix is here. It seems to me the condition should use AND
instead of OR anyway, but I may not understand the original intent. It is clear that it is
too strict as is.
> However, the code following that line assumes all 4 objects are not null, so a patch
would need to change more than just the condition.

This message was sent by Atlassian JIRA

View raw message