manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Hop count problem
Date Mon, 12 Aug 2013 14:29:35 GMT
Hi Erlend,

The Document Status report shows these documents because they are still in
the queue.  The reasons for this could be several.  Documents that exceed
the hopcount by 1 level are allowed to remain in the queue for bookkeeping
purposes.  "scheduled date" as given only meaningful if the document is in
an active state; my guess is that these documents are not in fact in that
state, but rather in the state HOPCOUNT_EXCEEDED.  Can you include one
complete row from the Document Status report for one of the missing

When you added documents to the seed list, what did the Simple History say
when they were fetched?  If they don't appear in the simple history, they
SHOULD have nevertheless appeared in the log, with an explanation of why
they were excluded, provided you have connector debugging enabled.


On Mon, Aug 12, 2013 at 10:19 AM, Erlend Garåsen <>wrote:

> I finally found the missing documents in simple history by going longer
> back in time. They were deleted from Solr in May which seem to indicate
> that they shouldn't be included for some reason I haven't found.
> The scheduled date from "document status" seems odd as well:
> 01-01-1970 01:00:00.000
> This date shows up for all the missing documents. Can this be the source
> of the problem?
> I changed the log level for HttpClient to DEBUG just in case.  No network
> or other problems. The missing documents are not being fetched:
> If MCF should try to refetch unavailable documents, we should expect to
> see entries about these hosts in the manifoldcf.log. The only entries are
> the two documents as I previously mentioned:
> Thus there is no need to enter one document after another in the seed
> list. Well, I did, but without any help. The first links that appear on the
> main page and that I tried to include are:
>**xhtml <>
>**xhtml <>
> Erlend
> On 8/12/13 2:21 PM, Karl Wright wrote:
>> Hi Erlend,
>> I suggest you start with the seed document.  Did that get fetched?
>> Then, chase the path to the missing document.  Did those get fetched?
>> Stop with the FIRST document that did not get fetched, and see if you
>> can figure out why.
>> Thanks,
>> Karl
>> On Mon, Aug 12, 2013 at 8:16 AM, Erlend Garåsen <
>> <mailto:e.f.garasen@usit.uio.**no <>>> wrote:
>>     On 8/12/13 1:31 PM, Karl Wright wrote:
>>         Based on your report that the test environment works OK, and the
>>         production environment does not, I expect there is something
>>         like this
>>         going on.  I know you attempted to fetch the intervening
>>         document from
>>         your test environment, but it is conceivable that the production
>>         environment is unable to get it.  You should see evidence of
>>         that in the
>>         simple history, if so.
>>     I have looked through the complete history regarding this host, and
>>     none of the other documents have ever been fetched. The only thing I
>>     can see is an illegal robots.txt file:
>>     robots parse <>
>>              HTML    0       1       Robots file contained HTML, skipped
>>     I don't think this robots file has stopped MCF from crawling the
>>     other documents since I can see this entry in the our test
>>     environment as well. I even tried to disable robots.txt checks, but
>>     the problems persist.
>>     I forgot to mention that the hopcount mode is "Keep unreachable
>>     documents, forever"
>>     So, if I understand you correctly, there is no point of hacking the
>>     database since MCF will try to refetch unreachable documents anyway.
>>     I can of course enable HttpClient logging and check whether MCF
>>     tries to fetch these resources at all.
>>     Erlend

View raw message