nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: problems http-client
Date Fri, 06 Jan 2006 21:35:57 GMT
>I have started to see this problem recently. topN=200000 per crawl, but
>fetched pages = 150000 - 170000, while error pages = 2000 - 5000.  >25000
>pages are missing.  this is reproducible with nutch0.7.1, both protocol-http
>and protocol-httpclient are included.

Depending on how you have Nutch configured, redirects can result in 
pages getting skipped, if the redirect count exceeds the 
(configurable) limit.

I don't know whether the "not found" HTTP status results in skipped 
(not reported as an error) case.

>I also see lots of "Response content length is not known" in the log.  but,
>can't find where it comes from.  Which class logs this message?

This is coming from the Jakarta commons httpclient code:


-- Ken

>On 12/19/05, Stefan Groschupf <> wrote:
>>  Hi there,
>>  is there someone out there that can confirm a problem we discovered?
>>  We was wondering why not all pages of a  generated segments was
>>  fetched. The most strange thing was that the  sum of errors and
>>  sucesspages was never the same as we defined in topN when generating
>>  a sgemtent .
>>  First we discovered a problem with the segment size, but I can not
>>  reproduce the problem anymore with the latest trunk code. :-/
>>  Very strange since I don't think something changed something but I
>>  was able to reproduce that the size of the segment is around than 50%
>>  of the defined size (topN) on 2 different map reduce installations.
>>  Anyway today we note that when fetching with http-client the sum of
>>  errors and fetched pages is  much less than the size defined when
>>  generating the segment.
>>  Changing to protocol-http solves the problem.
>>  Has anyone also note this behavior?
>>  Thanks for comments.
>>  Stefan

Ken Krugler
Krugle, Inc.
+1 530-470-9200

View raw message