nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
Date Tue, 05 Jan 2016 14:10:40 GMT


Markus Jelsma commented on NUTCH-2184:

Hello Lewis!

* it should be no problem. But since IndexerMapReduce is complicated, i would love to have
a simple unit test for it so we can guard for breaking things.
* we should make sure the fetchDatum also caries the desired parseData fields, that we usually
store in the CrawlDatum. This is not always true, see the fix i did for NUTCH-2093. I think
it should be possible to index as much fields as with the CrawlDatum. If you implement it
as such, then custom indexing filters that use CrawlDatum still work :)
* Well yes, i think i already answered that question myself indeed, silly.

This feature would be very handy for small segments but large CrawlDBs! 

> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>                 Key: NUTCH-2184
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>         Attachments: NUTCH-2184.patch
> Sometimes when working with distributed team(s), we have found that we can 'loose' data
structures which are currently considered as critical e.g. crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no accompanying
crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in [IndexerMapReduce|]
crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case where you
ONLY have segments and want to force an index for every record present.

This message was sent by Atlassian JIRA

View raw message