manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Hop count problem
Date Tue, 13 Aug 2013 22:32:25 GMT
I may have a scenario that could trigger the problem.

(1) Set the max hops for a job relatively low
(2) Crawl
(3) Increase the max hops
(4) Crawl again

I think under these conditions, it may be that we're not properly removing
the "hop count exceeded" states for documents that were encountered in the
first crawl that were too far from the seeds.

If this is the problem, it should be easy to confirm.  I'm not quite sure
how to fix it yet though - need to do some research.

Karl


On Tue, Aug 13, 2013 at 10:16 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Erlend,
>
> I see what must be happening.  The intrinsiclink table already has the
> link to the skuespill document in it, and because of that, nothing in the
> hopcount world is even getting looked at.  So in a nutshell, the problem is
> that somehow the hopcount table's data was messed up, but now there's no
> good way to recover.
>
> I would really like to know how it got messed up in the first place, but
> since there's been a lot of activity on that machine it would be a
> challenge to come up with the exact sequence of events.  If you think you
> remember it, please write it down and maybe try it on your test instance.
> But for now, the simplest way to get the production instance back up and
> running is to do the following:
>
> - Note all the job settings and configuration
> - Delete the job
> - Recreate the job
> - Run the job
>
> Since there are very few documents in the job, it does not sound like much
> of a problem to do that.  Would this work for you?
> Karl
>
>
>
> On Tue, Aug 13, 2013 at 9:55 AM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no>wrote:
>
>> On 8/13/13 3:34 PM, Karl Wright wrote:
>>
>>  Can you enable hopcount debugging, and rerun?
>>> "org.apache.manifoldcf.**hopcount" set to the value "DEBUG" in
>>> properties.xml.
>>>
>>
>> For some odd reason, MCF does not log anything more with this
>> configuration entry enabled:
>>   <property name="org.apache.manifoldcf.**hopcount" value="DEBUG"/>
>>
>> I have double-checked everything - the configuration file is sucessfully
>> read after I restart the Agent process and there is no old processes
>> running (checked with ps command). MCF has responded to every change I have
>> done to properties.xml so far, but not this one.
>>
>> Here's the log output:
>>
>>  WARN 2013-08-13 15:50:57,350 (Worker thread '24') - Web: Unknown
>> robots.txt line from 'www.ibsen.uio.no:80': '<?xml version="1.0"
>> encoding="UTF-8"?>'
>>  WARN 2013-08-13 15:50:57,351 (Worker thread '24') - Web: Unknown
>> robots.txt line from 'www.ibsen.uio.no:80': '<!DOCTYPE html'
>>  WARN 2013-08-13 15:50:57,351 (Worker thread '24') - Web: Unknown
>> robots.txt line from 'www.ibsen.uio.no:80': '  PUBLIC "-//W3C//DTD XHTML
>> 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/**
>> DTD/xhtml1-transitional.dtd<http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd>
>> ">'
>>  WARN 2013-08-13 15:50:57,352 (Worker thread '24') - Web: Unknown
>> robots.txt line from 'www.ibsen.uio.no:80': '<html
>> saxon-error-attribute="http://**www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml>"
>> xml:lang="no"><head><meta http-equiv="Content-Type" content="text/html;
>> charset=utf-8"> </meta><title>Henrik Ibsens skrifter:
>> Feilmelding</title><link type="text/css" rel="stylesheet"
>> href="rammeverk.css" media="all"/><link type="text/css" rel="stylesheet"
>> href="vitnemouseover.css"/><**link xmlns:tei="http://www.tei-c.**
>> org/ns/1.0 <http://www.tei-c.org/ns/1.0>" xmlns:HIS="http://www.example.*
>> *org/ns/HIS <http://www.example.org/ns/HIS>" xmlns:exist="http://exist.**
>> sourceforge.net/NS/exist <http://exist.sourceforge.net/NS/exist>"
>> rel="icon" type="image/png" href="icons/favicon.ico"/><**script
>> xmlns:tei="http://www.tei-c.**org/ns/1.0 <http://www.tei-c.org/ns/1.0>"
>> xmlns:HIS="http://www.example.**org/ns/HIS<http://www.example.org/ns/HIS>"
>> xmlns:exist="http://exist.**sourceforge.net/NS/exist<http://exist.sourceforge.net/NS/exist>"
>> src="http://code.jquery.com/**jquery-1.6.2.min.js<http://code.jquery.com/jquery-1.6.2.min.js>"
>> type="text/javascript">return void;</script><script xmlns:tei="
>> http://www.tei-c.**org/ns/1.0 <http://www.tei-c.org/ns/1.0>" xmlns:HIS="
>> http://www.example.**org/ns/HIS <http://www.example.org/ns/HIS>"
>> xmlns:exist="http://exist.**sourceforge.net/NS/exist<http://exist.sourceforge.net/NS/exist>"
>> src="jquery-ui-1.8.23.custom.**min.js" type="text/javascript">return
>> void;</script><script type="text/javascript">'
>>  INFO 2013-08-13 15:51:02,571 (Worker thread '24') - WEB: FETCH URL|
>> http://www.ibsen.uio.no/|**1376401862447+121|302|0|<http://www.ibsen.uio.no/%7C1376401862447+121%7C302%7C0%7C>
>>  INFO 2013-08-13 15:51:05,383 (Worker thread '13') - WEB: FETCH URL|
>> http://www.ibsen.uio.no/**forside.xhtml|1376401865366+**16|200|11897|<http://www.ibsen.uio.no/forside.xhtml%7C1376401865366+16%7C200%7C11897%7C>
>>
>>
>>
>>
>

Mime
View raw message