I've opened a ticket - CONNECTORS-764.

On Tue, Aug 13, 2013 at 6:32 PM, Karl Wright <daddywri@gmail.com> wrote:
I may have a scenario that could trigger the problem.

(1) Set the max hops for a job relatively low
(2) Crawl
(3) Increase the max hops
(4) Crawl again

I think under these conditions, it may be that we're not properly removing the "hop count exceeded" states for documents that were encountered in the first crawl that were too far from the seeds.

If this is the problem, it should be easy to confirm.  I'm not quite sure how to fix it yet though - need to do some research.


On Tue, Aug 13, 2013 at 10:16 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi Erlend,

I see what must be happening.  The intrinsiclink table already has the link to the skuespill document in it, and because of that, nothing in the hopcount world is even getting looked at.  So in a nutshell, the problem is that somehow the hopcount table's data was messed up, but now there's no good way to recover.

I would really like to know how it got messed up in the first place, but since there's been a lot of activity on that machine it would be a challenge to come up with the exact sequence of events.  If you think you remember it, please write it down and maybe try it on your test instance.  But for now, the simplest way to get the production instance back up and running is to do the following:

- Note all the job settings and configuration
- Delete the job
- Recreate the job
- Run the job

Since there are very few documents in the job, it does not sound like much of a problem to do that.  Would this work for you?

On Tue, Aug 13, 2013 at 9:55 AM, Erlend Garåsen <e.f.garasen@usit.uio.no> wrote:
On 8/13/13 3:34 PM, Karl Wright wrote:

Can you enable hopcount debugging, and rerun?
"org.apache.manifoldcf.hopcount" set to the value "DEBUG" in properties.xml.

For some odd reason, MCF does not log anything more with this configuration entry enabled:
  <property name="org.apache.manifoldcf.hopcount" value="DEBUG"/>

I have double-checked everything - the configuration file is sucessfully read after I restart the Agent process and there is no old processes running (checked with ps command). MCF has responded to every change I have done to properties.xml so far, but not this one.

Here's the log output:

 WARN 2013-08-13 15:50:57,350 (Worker thread '24') - Web: Unknown robots.txt line from 'www.ibsen.uio.no:80': '<?xml version="1.0" encoding="UTF-8"?>'
 WARN 2013-08-13 15:50:57,351 (Worker thread '24') - Web: Unknown robots.txt line from 'www.ibsen.uio.no:80': '<!DOCTYPE html'
 WARN 2013-08-13 15:50:57,351 (Worker thread '24') - Web: Unknown robots.txt line from 'www.ibsen.uio.no:80': '  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">'
 WARN 2013-08-13 15:50:57,352 (Worker thread '24') - Web: Unknown robots.txt line from 'www.ibsen.uio.no:80': '<html saxon-error-attribute="http://www.w3.org/1999/xhtml" xml:lang="no"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </meta><title>Henrik Ibsens skrifter: Feilmelding</title><link type="text/css" rel="stylesheet" href="rammeverk.css" media="all"/><link type="text/css" rel="stylesheet" href="vitnemouseover.css"/><link xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:HIS="http://www.example.org/ns/HIS" xmlns:exist="http://exist.sourceforge.net/NS/exist" rel="icon" type="image/png" href="icons/favicon.ico"/><script xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:HIS="http://www.example.org/ns/HIS" xmlns:exist="http://exist.sourceforge.net/NS/exist" src="http://code.jquery.com/jquery-1.6.2.min.js" type="text/javascript">return void;</script><script xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:HIS="http://www.example.org/ns/HIS" xmlns:exist="http://exist.sourceforge.net/NS/exist" src="jquery-ui-1.8.23.custom.min.js" type="text/javascript">return void;</script><script type="text/javascript">'
 INFO 2013-08-13 15:51:02,571 (Worker thread '24') - WEB: FETCH URL|http://www.ibsen.uio.no/|1376401862447+121|302|0|
 INFO 2013-08-13 15:51:05,383 (Worker thread '13') - WEB: FETCH URL|http://www.ibsen.uio.no/forside.xhtml|1376401865366+16|200|11897|