nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney
Date Sat, 13 Jun 2015 17:37:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=79&rev2=80

  ## page was renamed from RunningNutchAndSolr
  ## Lang: En
  == Introduction ==
- Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web
page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking
broken links, and create a copy of all the visited pages for searching over. That’s where
Apache Solr comes in. Solr is an open source full text search framework, with Solr we can
search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty
straightforward as explained below.
- 
+ Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration,
relying on Apache Hadoop data structures, which are great for batch processing.
+ Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces
such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing.
Additonally, pluggable indexing exists for Apache Solr, Elastic Search, SolrCloud, etc.
+ We can find Web page hyperlinks in an automated manner, reduce lots of maintenance work,
for example checking broken links, and create a copy of all the visited pages for searching
over. 
+ This tutorial explains how to use Nutch with Apache Solr. Solr is an open source full text
search framework, with Solr we can search the visited pages from Nutch. Luckily, integration
between Nutch and Solr is pretty straightforward.
  Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also
removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application
and upon Apache Lucene for indexing. Just download a binary release from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
+ 
+ == Learning Outcomes ==
+ By the end of this tutorial you will
+  * Have a configured local Nutch crawler setup to crawl on one machine
+  * Learned how to understand and configure Nutch runtime configuration including seed URL
lists, URLFilters, etc.
+  * Have executed a Nutch crawl cycle and viewed the results of the Crawl Database
+  * Indexed Nutch crawl records into Apache Solr for full text search
+ 
+ Any issues with this tutorial should be reported to the [[http://nutch.apache.org/mailing_lists.html|Nutch
user@]] list.
  
  == Table of Contents ==
  <<TableOfContents(3)>>

Mime
View raw message