nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-268) Generator and lib-http use different definitions of "unique host"
Date Fri, 12 May 2006 23:29:10 GMT
    [ ] 

Andrzej Bialecki  commented on NUTCH-268:

I forgot to add: if we change Generator to use IP addresses, then we should warn users that
running a local caching DNS server becomes practically mandatory - otherwise Generator would
be very slow, not to mention that it would generate a lot of DNS traffic to external servers.

> Generator and lib-http use different definitions of "unique host"
> -----------------------------------------------------------------
>          Key: NUTCH-268
>          URL:
>      Project: Nutch
>         Type: Bug

>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>      Fix For: 0.8-dev

> Generator uses a host name, as extracted from URL, to determine the maximum number of
URLs from a unique host (when is set > 0). This supposedly should
prevent the situation where fetchlists become dominated by URLs coming from the same hosts,
which in turn would clash with "politeness" rules.
> However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and instead
use it's IP address (explicitly doing a DNS lookup on the host name extracted from URL). This
leads to the following undesirable behavior:
> * if DNS name resolves to different IPs (round-robin balancing), then technically we
are in violation of the "politeness" rules, because lib-http doesn't see this as a conflict
and permits concurrent accesses to the same host name.
> * if different DNS names resolve to the same IP address (very common: CNAME-s, subdomains,
web hosting, etc) then the purpose of is defeated, because lib-http
will block more frequently than intended, leading to excessive numbers of  "Exceeded http.max.delays"
> Proposed solution: synchronize Generator and lib-http in their interpretation of "unique
host". Introduce a boolean property which instructs both Generator and lib-http to use in
both places either IP addresses or host names as "unique hosts".

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message