nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb
Date Wed, 08 Nov 2017 22:24:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244849#comment-16244849
] 

Sebastian Nagel commented on NUTCH-2456:
----------------------------------------

{quote}What will this patch achieve then? Just the case of ignoring dbDatum i presume?{quote}
No, the dbDatum is never ignored. If it is present it is used. But if it's not there (because
the CrawlDb wasn't updated
or because indexer is called without CrawlDb) it is not used.

{quote}How about index.*.md? ...{quote}
Indexing filters still get the fetchDatum, not the dbDatum. Only reprUrl and signature are
copied from dbDatum to fetchDatum.

That's two more features which do not work without (properly updated) CrawlDb, same as dedup
and orphans. It's worth a note or warning...

{quote}If have a hard time reading githubs output here, my problem.{quote}
Better, just look at the diff: https://github.com/apache/nutch/pull/240/files

> Allow to index pages/URLs not contained in CrawlDb
> --------------------------------------------------
>
>                 Key: NUTCH-2456
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2456
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Critical
>
> If http.redirect.max is set to a positive value, the Fetcher will follow redirects, creating
a new CrawlDatum.
> If the redirected URL is fetched and parsed, during indexing for it we have a special
case: dbDatum is null. This means that in [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
the document is not indexed, as it is assumed it only has inlinks (actually it has everything
but dbDatum).
> I'm not sure what the correct fix is here. It seems to me the condition should use AND
instead of OR anyway, but I may not understand the original intent. It is clear that it is
too strict as is.
> However, the code following that line assumes all 4 objects are not null, so a patch
would need to change more than just the condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message