manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Hop count problem
Date Mon, 12 Aug 2013 12:21:56 GMT
Hi Erlend,

I suggest you start with the seed document.  Did that get fetched?  Then,
chase the path to the missing document.  Did those get fetched?  Stop with
the FIRST document that did not get fetched, and see if you can figure out
why.

Thanks,
Karl



On Mon, Aug 12, 2013 at 8:16 AM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no>wrote:

> On 8/12/13 1:31 PM, Karl Wright wrote:
>
>  Based on your report that the test environment works OK, and the
>> production environment does not, I expect there is something like this
>> going on.  I know you attempted to fetch the intervening document from
>> your test environment, but it is conceivable that the production
>> environment is unable to get it.  You should see evidence of that in the
>> simple history, if so.
>>
>
> I have looked through the complete history regarding this host, and none
> of the other documents have ever been fetched. The only thing I can see is
> an illegal robots.txt file:
> robots parse    www.ibsen.uio.no:80
>         HTML    0       1       Robots file contained HTML, skipped
>
> I don't think this robots file has stopped MCF from crawling the other
> documents since I can see this entry in the our test environment as well. I
> even tried to disable robots.txt checks, but the problems persist.
>
> I forgot to mention that the hopcount mode is "Keep unreachable documents,
> forever"
>
> So, if I understand you correctly, there is no point of hacking the
> database since MCF will try to refetch unreachable documents anyway. I can
> of course enable HttpClient logging and check whether MCF tries to fetch
> these resources at all.
>
> Erlend
>
>

Mime
View raw message