nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
Date Thu, 29 Jun 2017 16:48:00 GMT


Lewis John McGibbney commented on NUTCH-2184:

Hi [~markus17] I need to finish the bloody MR tests over at
I am very tight on cycles right now. If you can pick up the patch and want to work with it
then by all means please go ahead ;)

> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>                 Key: NUTCH-2184
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.14
>         Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
> Sometimes when working with distributed team(s), we have found that we can 'loose' data
structures which are currently considered as critical e.g. crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no accompanying
crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in [IndexerMapReduce|]
crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case where you
ONLY have segments and want to force an index for every record present.

This message was sent by Atlassian JIRA

View raw message