nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "atuldj.jadhav" <>
Subject Updating Nutch Crawled index.
Date Thu, 04 Oct 2012 18:26:20 GMT

I have a SOLR (products) index built up using nutch.. which crawls a product
html page (a list) and index data...
unique id for product index is productID... (other schema fields are title,
quarter, disclaimer etc..)

Now I have got a requirement to update existing index for all products with
an additional field (product manager)... I have got set of Product xml files
which contain Product ID and Manager name in it.

I can easily use data Import handler to iterate through all xml's file read
ProductID and its corresponding manager name...

however when I do full-import on existing index. it updates the index for
each ProductID, by deleting old index values and updates only ProductID and
manager... old existing index values for (title, quarter, disclaimer etc)
are lost... I have only ProductID and manager in index.

How shall I manage this, without loosing data?

View this message in context:
Sent from the Nutch - Dev mailing list archive at

View raw message