nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "GettingNutchRunningWithWindows" by FrankMcCown
Date Wed, 11 Feb 2009 18:25:40 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

The comment on the change is:
Added some clarifications

------------------------------------------------------------------------------
  
  === Download ===
  
- [http://lucene.apache.org/nutch/release/ Download] the release and extract anywhere on your
hard disk e.g. `c:\nutch-0.9`
+ [http://lucene.apache.org/nutch/release/ Download] the release and extract on your hard
disk in a directory that ''does not'' contain a space in it (e.g., `c:\nutch-0.9`).  If the
directory does contain a space (e.g., `c:\my programs\nutch-0.9`), the Nutch scripts will
not work properly.
  
- Create an empty text file in your nutch directory e.g. `urls` and add the URLs of the sites
you want to crawl.
+ Create an empty text file (use any name you wish) in your nutch directory (e.g., `urls`)
and add the URLs of the sites you want to crawl.
  
- Add your URLs to the `crawl-urlfilter.txt` (e.g. `C:\nutch-0.9\conf\crawl-urlfilter.txt`).
An entry could look like this:
+ Add your URLs to the `crawl-urlfilter.txt` (e.g., `C:\nutch-0.9\conf\crawl-urlfilter.txt`).
An entry could look like this:
  {{{
  +^http://([a-z0-9]*\.)*apache.org/
  }}}
  
- Load up cygwin and naviagte to your nutch directory.  When cygwin launches you'll usually
find yourself in your user folder (e.g. `C:\Documents and Settings\username`).
+ Load up cygwin and navigate to your `nutch` directory.  When cygwin launches, you'll usually
find yourself in your user folder (e.g. `C:\Documents and Settings\username`).
  
- If your workstation needs to go through a windows authentication proxy to get to the internet
then you can use an application such as the [http://sourceforge.net/projects/ntlmaps/ NTLM
Authorization Proxy Server] to get through it.  You'll then need to edit the `nutch-site.xml`
file to point to the port opened by the app.
+ If your workstation needs to go through a Windows Authentication Proxy to get to the Internet
(this is not common), then you can use an application such as the [http://sourceforge.net/projects/ntlmaps/
NTLM Authorization Proxy Server] to get through it.  You'll then need to edit the `nutch-site.xml`
file to point to the port opened by the app.
  
  == Intranet Crawling ==
  
@@ -48, +48 @@

  {{{
  bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log
  }}}
- then a folder called crawl/ is created in your nutch directory, along with the crawl.log
file.  Use this log file to debug any errors you might have.
+ then a folder called `crawl` is created in your `nutch` directory, along with the crawl.log
file.  Use this log file to debug any errors you might have.
  
  You'll need to delete or move the crawl directory before starting the crawl off again unless
you specify another path on the command above.
  

Mime
View raw message