nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "FAQ" by GodmarBack
Date Wed, 06 Jan 2010 23:30:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by GodmarBack.
The comment on this change is: Corrected formatting - the {{{ must be in the first column,
apparently..
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=111&rev2=112

--------------------------------------------------

  There are at least two choices to do that:
  
    First you need to copy the .WAR file to the servlet container webapps folder.
+ {{{
-      {{{% cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
+    % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
  }}}
  
    1) After building your first index, start Tomcat from the index folder.
      Assuming your index is located at /index :
+ {{{
-     {{{% cd /index/
+ % cd /index/
- % $CATATALINA_HOME/bin/startup.sh}}}
+ % $CATATALINA_HOME/bin/startup.sh
+ }}}
      '''Now you can search.'''
  
    2) After building your first index, start and stop Tomcat which will make Tomcat extrat
the Nutch webapp. Than you need to edit the nutch-site.xml and put in it the location of the
index folder.
+ {{{
-     {{{% $CATATALINA_HOME/bin/startup.sh
+ % $CATATALINA_HOME/bin/startup.sh
- % $CATATALINA_HOME/bin/shutdown.sh}}}
+ % $CATATALINA_HOME/bin/shutdown.sh
+ }}}
  
+ {{{
-     {{{% vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml
+ % vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml
  <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
  
@@ -85, +91 @@

  
  </nutch-conf>
  
- % $CATATALINA_HOME/bin/startup.sh}}}
+ % $CATATALINA_HOME/bin/startup.sh
+ }}}
  
  === Injecting ===
  
@@ -110, +117 @@

  
        You'll need to create a file fetcher.done in the segment directory an than: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]],
[[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and [[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]]
.
        Assuming your index is at /index
+ {{{ 
-       {{{ % touch /index/segments/2005somesegment/fetcher.done
+ % touch /index/segments/2005somesegment/fetcher.done
  
  % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
  
  % bin/nutch generate /index/db/ /index/segments/2005somesegment/
  
- % bin/nutch fetch /index/segments/2005somesegment}}}
+ % bin/nutch fetch /index/segments/2005somesegment
+ }}}
  
        All the pages that were not crawled will be re-generated for fetch. If you fetched
lots of pages, and don't want to have to re-fetch them again, this is the best way.
  
@@ -146, +155 @@

  
  If you have a fast internet connection (> 10Mb/sec) your bottleneck will definitely be
in the machine itself (in fact you will need multiple machines to saturate the data pipe).
 Empirically I have found that the machine works well up to about 1000-1500 threads.  
  
- To get this to work on my Linux box I needed to set the ulimit to 65535 (ulimit -n 65535),
and I had to make sure that the DNS server could handle the load (we had to speak with our
colo to get them to shut off an artifical cap on the DNS servers).  Also, in order to get
the speed up to a reasonable value, we needed to set the maximum fetches per host to 100 (otherwise
we get a quick start followed by a very long slow tail of fetching).
+ To get this to work on my Linux box I needed to set the ulimit to 65535 (ulimit -n 65535),
and I had to make sure that the DNS server could handle the load (we had to speak with our
colo to get them to shut off an artificial cap on the DNS servers).  Also, in order to get
the speed up to a reasonable value, we needed to set the maximum fetches per host to 100 (otherwise
we get a quick start followed by a very long slow tail of fetching).
  
  To other users: please add to this with your own experiences, my own experience may be atypical.
  
@@ -208, +217 @@

      +.*
  
    3) By default the [[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html|"file
plugin"]] is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry
like this:
- 
+ {{{
      <property>
        <name>plugin.includes</name>
        <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
      </property>
+ }}}
  
  Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha
is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with
http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click
on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and this behavior
may be disabled by a [[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]]
(see security.checkloaduri). IE5 does not have this problem.
  
  ==== Nutch crawling parent directories for file protocol ->  misconfigured URLFilters
====
  [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex you should put
the following in regex-urlfilter.txt :
  {{{
- 
  +^file:///c:/top/directory/
  -.
  }}}
@@ -243, +252 @@

  '''What is happening?'''
  
    By default, the size of the documents downloaded by Nutch is limited (to 65536 bytes).
To allow Nutch to download larger files (via HTTP), modify nutch-site.xml and add an entry
like this:
+ {{{
      <property>
        <name>http.content.limit</name>
        <value>'''150000'''</value>
      </property>
- 
+ }}}
    If you do not want to limit the size of downloaded documents, set http.content.limit to
a negative value:
+ {{{
      <property>
        <name>http.content.limit</name>
        <value>'''-1'''</value>
      </property>
+ }}}
  
  === Segment Handling ===
  
@@ -282, +294 @@

      <description>The host and port that the MapReduce job tracker runs at. If "local",
then jobs are run in-process as a single map and reduce task.</description>
    </property>
  
- 
    edit conf/mapred-default.xml
    <property>
      <name>mapred.map.tasks</name>
      <value>4</value>
      <description>define mapred.map.tasks to be multiple of number of slave hosts
- </description>
+     </description>
    </property>
  
    <property>
@@ -298, +309 @@

    </property>
  
    create a file with slave host names
- 
-   {{{
+ {{{
    % echo localhost >> ~/.slaves
    % echo somemachin >> ~/.slaves}}}
  
    start all ndfs & mapred daemons
-   {{{
+ {{{
    % bin/start-all.sh
    }}}
  
    create a directory with seed list file
-   {{{
+ {{{
    % mkdir seeds
    % echo http://www.cnn.com/ > seeds/urls
    }}}
  
-   copt the seed directory to ndfs
+   copy the seed directory to ndfs
-   {{{
+ {{{
    % bin/nutch ndfs -put seeds seeds
    }}}
  
    crawl a bit
-   {{{
+ {{{
    % bin/nutch crawl seeds -depth 3
    }}}
  
@@ -336, +346 @@

  ==== How to send commands to NDFS? ====
  
    list files in the root of NDFS
-   {{{
+ {{{
    [root@xxxxxx mapred]# bin/nutch ndfs -ls /
    050927 160948 parsing file:/mapred/conf/nutch-default.xml
    050927 160948 parsing file:/mapred/conf/nutch-site.xml

Mime
View raw message