manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Hop count problem
Date Tue, 13 Aug 2013 22:42:56 GMT
I've opened a ticket - CONNECTORS-764.
Karl



On Tue, Aug 13, 2013 at 6:32 PM, Karl Wright <daddywri@gmail.com> wrote:

> I may have a scenario that could trigger the problem.
>
> (1) Set the max hops for a job relatively low
> (2) Crawl
> (3) Increase the max hops
> (4) Crawl again
>
> I think under these conditions, it may be that we're not properly removing
> the "hop count exceeded" states for documents that were encountered in the
> first crawl that were too far from the seeds.
>
> If this is the problem, it should be easy to confirm.  I'm not quite sure
> how to fix it yet though - need to do some research.
>
> Karl
>
>
> On Tue, Aug 13, 2013 at 10:16 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Erlend,
>>
>> I see what must be happening.  The intrinsiclink table already has the
>> link to the skuespill document in it, and because of that, nothing in the
>> hopcount world is even getting looked at.  So in a nutshell, the problem is
>> that somehow the hopcount table's data was messed up, but now there's no
>> good way to recover.
>>
>> I would really like to know how it got messed up in the first place, but
>> since there's been a lot of activity on that machine it would be a
>> challenge to come up with the exact sequence of events.  If you think you
>> remember it, please write it down and maybe try it on your test instance.
>> But for now, the simplest way to get the production instance back up and
>> running is to do the following:
>>
>> - Note all the job settings and configuration
>> - Delete the job
>> - Recreate the job
>> - Run the job
>>
>> Since there are very few documents in the job, it does not sound like
>> much of a problem to do that.  Would this work for you?
>> Karl
>>
>>
>>
>> On Tue, Aug 13, 2013 at 9:55 AM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no>wrote:
>>
>>> On 8/13/13 3:34 PM, Karl Wright wrote:
>>>
>>>  Can you enable hopcount debugging, and rerun?
>>>> "org.apache.manifoldcf.**hopcount" set to the value "DEBUG" in
>>>> properties.xml.
>>>>
>>>
>>> For some odd reason, MCF does not log anything more with this
>>> configuration entry enabled:
>>>   <property name="org.apache.manifoldcf.**hopcount" value="DEBUG"/>
>>>
>>> I have double-checked everything - the configuration file is sucessfully
>>> read after I restart the Agent process and there is no old processes
>>> running (checked with ps command). MCF has responded to every change I have
>>> done to properties.xml so far, but not this one.
>>>
>>> Here's the log output:
>>>
>>>  WARN 2013-08-13 15:50:57,350 (Worker thread '24') - Web: Unknown
>>> robots.txt line from 'www.ibsen.uio.no:80': '<?xml version="1.0"
>>> encoding="UTF-8"?>'
>>>  WARN 2013-08-13 15:50:57,351 (Worker thread '24') - Web: Unknown
>>> robots.txt line from 'www.ibsen.uio.no:80': '<!DOCTYPE html'
>>>  WARN 2013-08-13 15:50:57,351 (Worker thread '24') - Web: Unknown
>>> robots.txt line from 'www.ibsen.uio.no:80': '  PUBLIC "-//W3C//DTD
>>> XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/**
>>> DTD/xhtml1-transitional.dtd<http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd>
>>> ">'
>>>  WARN 2013-08-13 15:50:57,352 (Worker thread '24') - Web: Unknown
>>> robots.txt line from 'www.ibsen.uio.no:80': '<html
>>> saxon-error-attribute="http://**www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml>"
>>> xml:lang="no"><head><meta http-equiv="Content-Type" content="text/html;
>>> charset=utf-8"> </meta><title>Henrik Ibsens skrifter:
>>> Feilmelding</title><link type="text/css" rel="stylesheet"
>>> href="rammeverk.css" media="all"/><link type="text/css" rel="stylesheet"
>>> href="vitnemouseover.css"/><**link xmlns:tei="http://www.tei-c.**
>>> org/ns/1.0 <http://www.tei-c.org/ns/1.0>" xmlns:HIS="http://www.example.
>>> **org/ns/HIS <http://www.example.org/ns/HIS>" xmlns:exist="http://exist.
>>> **sourceforge.net/NS/exist <http://exist.sourceforge.net/NS/exist>"
>>> rel="icon" type="image/png" href="icons/favicon.ico"/><**script
>>> xmlns:tei="http://www.tei-c.**org/ns/1.0 <http://www.tei-c.org/ns/1.0>"
>>> xmlns:HIS="http://www.example.**org/ns/HIS<http://www.example.org/ns/HIS>"
>>> xmlns:exist="http://exist.**sourceforge.net/NS/exist<http://exist.sourceforge.net/NS/exist>"
>>> src="http://code.jquery.com/**jquery-1.6.2.min.js<http://code.jquery.com/jquery-1.6.2.min.js>"
>>> type="text/javascript">return void;</script><script xmlns:tei="
>>> http://www.tei-c.**org/ns/1.0 <http://www.tei-c.org/ns/1.0>" xmlns:HIS="
>>> http://www.example.**org/ns/HIS <http://www.example.org/ns/HIS>"
>>> xmlns:exist="http://exist.**sourceforge.net/NS/exist<http://exist.sourceforge.net/NS/exist>"
>>> src="jquery-ui-1.8.23.custom.**min.js" type="text/javascript">return
>>> void;</script><script type="text/javascript">'
>>>  INFO 2013-08-13 15:51:02,571 (Worker thread '24') - WEB: FETCH URL|
>>> http://www.ibsen.uio.no/|**1376401862447+121|302|0|<http://www.ibsen.uio.no/%7C1376401862447+121%7C302%7C0%7C>
>>>  INFO 2013-08-13 15:51:05,383 (Worker thread '13') - WEB: FETCH URL|
>>> http://www.ibsen.uio.no/**forside.xhtml|1376401865366+**16|200|11897|<http://www.ibsen.uio.no/forside.xhtml%7C1376401865366+16%7C200%7C11897%7C>
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message