nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "HttpAuthenticationSchemes" by susam
Date Wed, 17 Jun 2009 17:22:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
Added TableOfContents and minor edits in Prerequisites and Optional sect

------------------------------------------------------------------------------
+ [[TableOfContents]]
+ 
  == Introduction ==
  This is a feature in Nutch that allows the crawler to authenticate itself to websites requiring
NTLM, Basic or Digest authentication. This feature can not do POST based authentication that
depends on cookies. More information on this can be found at: HttpPostAuthentication
  
@@ -18, +20 @@

  Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is
very brief, therefore this section would explain it in a little more detail. In all the examples
below, the root element <auth-configuration> has been omitted for the sake of clarity.
  
  === Prerequisites ===
- In order use HTTP Authentication your Nutch install must be configured to use 'protocol-httpclient'
instead of the default 'protocol-http'. To make this change copy the 'plugin.includes' property
from 'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml'. Within that property
replace 'protocol-http' with 'protocol-httpclient'. If you have made no other changes it will
look as follows:
+ In order to use HTTP Authentication, the Nutch crawler must be configured to use 'protocol-httpclient'
instead of the default 'protocol-http'. To do this copy 'plugin.includes' property from 'conf/nutch-default.xml'
into 'conf/nutch-site.xml'. Replace 'protocol-http' with 'protocol-httpclient' in the value
of the property. If you have made no other changes it should look as follows:
  {{{
  <property>
    <name>plugin.includes</name>
@@ -35, +37 @@

  }}}
  
  === Optional ===
- By default Nutch use credential from 'httpclient-auth.xml'. If you wish to use a different
file you will need to copy the 'http.auth.file' property from 'conf/nutch-default.xml' and
paste it into 'conf/nutch-site.xml' and then modify the '<value>' element. The default
property appears as follows:
+ By default Nutch uses credentials from 'conf/httpclient-auth.xml'. If you wish to use a
different file, the file should be placed in the 'conf' directory and 'http.auth.file' property
should be copied from 'conf/nutch-default.xml' into 'conf/nutch-site.xml' and then the file
name in the '<value>' element should be edited accordingly. The default property appears
as follows:
  {{{
  <property>
    <name>http.auth.file</name>
@@ -43, +45 @@

    <description>Authentication configuration file for 'protocol-httpclient' plugin.</description>
  </property>
  }}}
- 
  
  === Crawling an Intranet with Default Authentication Scope ===
  Let's say all pages of an intranet are protected by basic, digest or ntlm authentication
and there is only one set of credentials to be used for all web pages in the intranet, then
a configuration as described below is enough. This is also the simplest possible configuration
possible for authentication schemes.

Mime
View raw message