nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "NutchTutorial" by SebastianNagel
Date Fri, 10 Aug 2018 11:14:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by SebastianNagel:

Updates for release of Nutch 1.15, fix Deduplication section

       Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...]
(<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
[-addBinaryContent] [-base64]
-      Example: bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/
crawl/segments/20131108063838/ -filter -normalize -deleteGone
+      Example: bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/
-filter -normalize -deleteGone
  === Step-by-Step: Deleting Duplicates ===
- Once indexed the entire contents, it must be disposed of duplicate urls in this way ensures
that the urls are unique.
+ Duplicates (identical content but different URL) are optionally marked in the CrawlDb and
are deleted later in the Solr index.
- MapReduce:
+ MapReduce "dedup" job:
-  * Map: Identity map where keys are digests and values are  [[|SolrRecord]]
instances (which contain id, boost and timestamp)
-  * Reduce: After map, [[|SolrRecord]]s with the same
digest will be grouped together. Now, of these documents with the same digests, delete all
of them except the one with the highest score (boost field). If two (or more) documents have
the same score, then the document with the latest timestamp is kept. Again, every other is
deleted from solr index.
+  * Map: Identity map where keys are digests and values are CrawlDatum records
+  * Reduce: CrawlDatums with the same digest are marked (except one of them) as duplicates.
There are multiple heuristics available to choose the item which is not marked as duplicate
- the one with the shortest URL, fetched most recently, or with the highest score.
+      Usage: bin/nutch dedup <crawldb> [-group <none|host|domain>] [-compareOrder
-      Usage: bin/nutch dedup <solr url>
-      Example: /bin/nutch dedup http://localhost:8983/solr
+ Deletion in the index is performed by the cleaning job (see below) or if the index job is
called with the command-line flag {{-deleteGone}}.
  For more information see [[|dedup documentation]].
@@ -310, +311 @@

  Every version of Nutch is built against a specific Solr version, but you may also try a
"close" version.
  || Nutch || Solr   ||
+ || 1.15  || 7.3.1  ||
  || 1.14  || 6.6.0  ||
  || 1.13  || 5.5.0  ||
  || 1.12  || 5.4.1  ||
+ To install Solr:
   * download binary file from [[|here]]
   * unzip to `$HOME/apache-solr`, we will now refer to this as `${APACHE_SOLR_HOME}`
   * create resources for a new nutch solr core `cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/basic_configs
@@ -321, +324 @@

   * make sure that there is no `managed-schema` "in the way": `rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema`
   * start the solr server `${APACHE_SOLR_HOME}/bin/solr start`
   * create the nutch core `${APACHE_SOLR_HOME}/bin/solr create -c nutch -d server/solr/configsets/nutch/conf/`
+ After that you need to point Nutch to the Solr instance:
+  * (Nutch 1.15 and later) edit the file {{conf/index-writers.xml}}, see IndexWriters
-  * add the core name to the Solr server URL: `-Dsolr.server.url=http://localhost:8983/solr/nutch`
+  * (until Nutch 1.14) add the core name to the Solr server URL: `-Dsolr.server.url=http://localhost:8983/solr/nutch`
  = Verify Solr installation =
  After you started Solr admin console, you should be able to access the following links:

View raw message