nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator
Date Mon, 14 Sep 2015 15:59:46 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743717#comment-14743717
] 

Sebastian Nagel commented on NUTCH-2093:
----------------------------------------

+1
Also a fetch datum of an injected URL can have no signature. Only re-fetched fetch datums
may have a signature, but it is the signature from the previous fetch which may differ from
the current one.
But in general, it could be more transparent to pass also the db datum to the indexing filters.
But this would change the IndexingFilter interface.



> Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator
> -----------------------------------------------------------------------------
>
>                 Key: NUTCH-2093
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2093
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.10
>            Reporter: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: NUTCH-2093.patch
>
>
> In IndexerMapReduce, a fetchDatum is passed to the indexing filters. However, when this
fetchDatum was created via FreeGenerator, it has no signature attached, and indexing filters
don't see it.
> This patch copies the signature from the dbDatum just before passed to indexing filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message