nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "HttpAuthenticationSchemes" by susam
Date Mon, 15 Mar 2010 21:37:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "HttpAuthenticationSchemes" page has been changed by susam.
http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diff&rev1=18&rev2=19

--------------------------------------------------

  === Important Points ===
   1. For <authscope> tag, 'host' and 'port' attribute should always be specified. 'realm'
and 'scheme' attributes may or may not be specified depending on your needs. If you are tempted
to omit the 'host' and 'port' attribute, because you want the credentials to be used for any
host and any port for that realm/scheme, please use the 'default' tag instead. That's what
'default' tag is meant for.
   1. One authentication scope should not be defined twice as different <authscope>
tags for different <credentials> tag. However, if this is done by mistake, the credentials
for the last defined <authscope> tag would be used. This is because, the XML parsing
code, reads the file from top to bottom and sets the credentials for authentication-scopes.
If the same authentication scope is encountered once again, it will be overwritten with the
new credentials. However, one should not rely on this behavior as this might change with further
developments.
-  1. Do not define multiple authscope tags with the same host, port but different realms
if the server requires NTLM authentication. This means there should not be multiple tags with
same host, port, scheme="NTLM" but different realms. If you are omitting the scheme attribute
and the server requires NTLM authentication, then there should not be multiple tags with same
host, port but different realms. This is discussed more in the next section.
+  1. Do not define multiple authscope tags with the same host, port but different realms
if the server requires NTLM authentication. This means there should not be multiple authscope
tags with same host, port, scheme="NTLM" but different realms. If you are omitting the scheme
attribute and the server requires NTLM authentication, then there should not be multiple tags
with same host, port but different realms. This is discussed more in the next section.
   1. If you are using NTLM scheme, you should also set the 'http.agent.host' property in
conf/nutch-site.xml
  
  === A note on NTLM domains ===
  NTLM does not use the concept of realms. Therefore, multiple realms for a web-server can
not be defined as different authentication scopes for the same web-server requiring NTLM authentication.
There should be exactly one authscope tag for NTLM scheme authentication scope for a particular
web-server. The authentication domain should be specified as the value of the 'realm' attribute.
NTLM authentication also requires the name of IP address of the host on which the crawler
is running. Thus, 'http.agent.host' should be set properly.
  
  == Underlying HttpClient Library ==
- 'protocol-httpclient' is based on [[http://jakarta.apache.org/httpcomponents/httpclient-3.x/|Jakarta
Commons HttpClient]]. Some servers support multiple schemes for authenticating users. Given
that only one scheme may be used at a time for authenticating, it must choose which scheme
to use. To accompish this, it uses an order of preference to select the correct authentication
scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior
during authentication, you might want to read the [[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html|HttpClient
Authentication Guide]].
+ 'protocol-httpclient' is based on [[http://hc.apache.org/httpclient-3.x/|Jakarta Commons
HttpClient]]. Some servers support multiple schemes for authenticating users. Given that only
one scheme may be used at a time for authenticating, it must choose which scheme to use. To
accomplish this, it uses an order of preference to select the correct authentication scheme.
By default this order is: NTLM, Digest, Basic. For more information on the behavior during
authentication, you might want to read the [[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient
Authentication Guide]].
  
  == Need Help? ==
  If you need help, please feel free to post your question to the [[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user
mailing list]]. The author of this work, Susam Pal, usually responds to mails related to authentication
problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the
debug log for 'protocol-httpclient' before running the crawler. To enable debug log for 'protocol-httpclient',
open 'conf/log4j.properties' and add the following line:

Mime
View raw message