nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Trivial Update of "bin/nutch solrdedup" by LewisJohnMcgibbney
Date Sun, 03 Jul 2011 03:41:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "bin/nutch solrdedup" page has been changed by LewisJohnMcgibbney:

  Query the solr server for the number of documents (say, N), Partition N among M map tasks.
For example, if we have two map tasks the first map task will deal with solr documents from
0 - (N / 2 - 1) and the second will deal with documents from (N / 2) to (N - 1). This can
be thought of as a linearly executing divide and conquer algorithm.
- '''MapReduce''':
+ '''Map Reduce''':
   * Map: Identity map where keys are digests and values are {@link SolrRecord} instances(which
contain id, boost and timestamp)
   * Reduce: After map, {@link SolrRecord}s with the same digest will be grouped together.
Now, of these documents with the same digests, delete all of them except the one with the
highest score (boost field). If two (or more) documents have the same score, then the document
with the latest timestamp is kept. Again, every other is deleted from solr index.

View raw message