manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Hop count problem
Date Tue, 13 Aug 2013 10:04:45 GMT
If this is still 1.2, then these were the unlogged reasons why a document
could be skipped:

(1) Length too long
(2) Output connector rejects mime type
(3) Output connector rejects url
(4) Document is not considered indexable according to the job constraints
(the "indexable" regular expressions)

Karl



On Tue, Aug 13, 2013 at 5:56 AM, Karl Wright <daddywri@gmail.com> wrote:

> What version of ManifoldCF is this?
>
> I ask because I updated the logging output in 1.3 to capture a number of
> cases that previously did not log a reason why they were skipped.
>
> Karl
>
>
>
> On Tue, Aug 13, 2013 at 5:27 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>wrote:
>
>>
>> OK, I have now changed the log level from INFO to DEBUG for connectors as
>> well. Here's the log:
>> http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
>>
>> The following entry indicates that one of the missing URLs is
>> found/extracted from a link:
>> DEBUG 2013-08-13 10:58:48,630 (Worker thread '9') - WEB: In html document
>> 'http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>',
>> found link to 'http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>
>> '
>>
>> Then the job just ends and all the extracted links were never fetched.
>>
>> Erlend
>>
>>
>> On 8/12/13 5:15 PM, Erlend Garåsen wrote:
>>
>>>
>>> Thanks, I will tomorrow and report thereafter. I hope we will find a
>>> simple explanation. :)
>>>
>>> E
>>>
>>> On 8/12/13 5:07 PM, Karl Wright wrote:
>>>
>>>> Hi Erlend,
>>>>
>>>> You have wire logging (httpclient) enabled, which is useful for
>>>> debugging fetch issues, but you do not have connector debugging on.  To
>>>> turn it on, add this to properties.xml:
>>>>
>>>> <property name="org.apache.manifoldcf.**connectors" value="DEBUG"/>
>>>>
>>>> thanks,
>>>> Karl
>>>>
>>>>
>>>> On Mon, Aug 12, 2013 at 10:53 AM, Erlend Garåsen
>>>> <e.f.garasen@usit.uio.no <mailto:e.f.garasen@usit.uio.**no<e.f.garasen@usit.uio.no>>>
>>>> wrote:
>>>>
>>>>     On 8/12/13 4:29 PM, Karl Wright wrote:
>>>>
>>>>         Hi Erlend,
>>>>
>>>>         The Document Status report shows these documents because they
>>>>         are still
>>>>         in the queue.  The reasons for this could be several.  Documents
>>>>         that
>>>>         exceed the hopcount by 1 level are allowed to remain in the
>>>>         queue for
>>>>         bookkeeping purposes.  "scheduled date" as given only meaningful
>>>>         if the
>>>>         document is in an active state; my guess is that these documents
>>>>         are not
>>>>         in fact in that state, but rather in the state
>>>>         HOPCOUNT_EXCEEDED.  Can
>>>>         you include one complete row from the Document Status report for
>>>>         one of
>>>>         the missing documents?
>>>>
>>>>
>>>>     For "http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml>
>>>>     <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml>
>>>> >":
>>>>     Job: Ibsen
>>>>
>>>>     State: Out of scope
>>>>     Status: Hopcount exceeded
>>>>     Scheduled: 01-01-1970 01:00:00.000
>>>>     Scheduled action: Process
>>>>     Retry count: N/A
>>>>     Retry limit: N/A
>>>>
>>>>
>>>>         When you added documents to the seed list, what did the Simple
>>>>         History
>>>>         say when they were fetched?  If they don't appear in the simple
>>>>         history,
>>>>         they SHOULD have nevertheless appeared in the log, with an
>>>>         explanation
>>>>         of why they were excluded, provided you have connector debugging
>>>>         enabled.
>>>>
>>>>
>>>>     OK, here is the seed list:
>>>>     http://www.ibsen.uio.no/
>>>>
>>>>     http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml>
>>>>     <http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>
>>>> >
>>>>     http://www.ibsen.uio.no/dikt._**_xhtml<http://www.ibsen.uio.no/dikt.__xhtml>
>>>>     <http://www.ibsen.uio.no/dikt.**xhtml<http://www.ibsen.uio.no/dikt.xhtml>
>>>> >
>>>>     http://www.ibsen.uio.no/brev._**_xhtml<http://www.ibsen.uio.no/brev.__xhtml>
>>>>     <http://www.ibsen.uio.no/brev.**xhtml<http://www.ibsen.uio.no/brev.xhtml>
>>>> >
>>>>     http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml>
>>>>     <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml>
>>>> >
>>>>     http://www.ibsen.uio.no/varia.**__xhtml<http://www.ibsen.uio.no/varia.__xhtml>
>>>>     <http://www.ibsen.uio.no/**varia.xhtml<http://www.ibsen.uio.no/varia.xhtml>
>>>> >
>>>>     http://www.ibsen.uio.no/__**undervisningsressurser.xhtml<http://www.ibsen.uio.no/__undervisningsressurser.xhtml>
>>>>     <http://www.ibsen.uio.no/**undervisningsressurser.xhtml<http://www.ibsen.uio.no/undervisningsressurser.xhtml>
>>>> >
>>>>
>>>>     Here is the results from simple history:
>>>>     08-12-2013 16:46:26.536         job end         1368534065016(Ibsen)
>>>>                      0       1
>>>>     08-12-2013 16:46:09.927         document ingest (Solr)
>>>>     http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>>>>     <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
>>>> >
>>>>              OK      11897   178
>>>>     08-12-2013 16:46:09.751         fetch
>>>>     http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>>>>     <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
>>>> >
>>>>              200     11897   17
>>>>     08-12-2013 16:44:48.829         fetch http://www.ibsen.uio.no/
>>>>              302     0       79484
>>>>     08-12-2013 16:44:48.727         robots parse www.ibsen.uio.no:80
>>>>     <http://www.ibsen.uio.no:80>
>>>>
>>>>              HTML    0       2       Robots file contained HTML, skipped
>>>>     08-12-2013 16:44:46.574         job start       1368534065016(Ibsen)
>>>>                      0       1
>>>>              1
>>>>
>>>>     HttpClient log:
>>>>     http://folk.uio.no/erlendfg/__**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log>
>>>>     <http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
>>>> >
>>>>
>>>>     Erlend
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message