manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Hop count problem
Date Tue, 13 Aug 2013 12:47:44 GMT
Looks like you need to re-enable connector debugging before we can see
anything.

Also, does the missing document (skuespill) appear in the Document Status
report after the crawl?  Can you include that here if it does?  (I am
betting it does not...)

Karl



On Tue, Aug 13, 2013 at 8:43 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>wrote:

>
> I couldn't find more information in the log after the upgrade. Yes, I'm
> running version 1.3 now  since I had to log in after the upgrade:
> http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
>
> I tried to fetch one of the missing documents by using Curl from our prod
> server. Looks like an OK response to me even though this is Curl and not
> HttpClient:
>
> -bash-3.2$ curl -vvv -H "User-Agent: Mozilla/5.0
> (ApacheManifoldCFWebCrawler; sok-core@usit.uio.no)" "
> http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml>
> "
> * About to connect() to www.ibsen.uio.no port 80
> *   Trying 129.240.7.27... connected
> * Connected to www.ibsen.uio.no (129.240.7.27) port 80
> > GET /sakprosa.xhtml HTTP/1.1
> > Host: www.ibsen.uio.no
> > Accept: */*
> > User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
> sok-core@usit.uio.no)
> >
> < HTTP/1.1 200 OK
> < Date: Tue, 13 Aug 2013 12:40:02 GMT
> < Server: Apache-Coyote/1.1
> < X-Cocoon-Version: 2.1.12-dev
> < Last-Modified: Fri, 09 Aug 2013 09:57:43 GMT
> < Content-Type: text/html
> < Content-Length: 11209
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html
> [...]
>
> E
>
>
> On 8/13/13 12:04 PM, Karl Wright wrote:
>
>> If this is still 1.2, then these were the unlogged reasons why a
>> document could be skipped:
>>
>> (1) Length too long
>> (2) Output connector rejects mime type
>> (3) Output connector rejects url
>> (4) Document is not considered indexable according to the job
>> constraints (the "indexable" regular expressions)
>>
>> Karl
>>
>>
>>
>> On Tue, Aug 13, 2013 at 5:56 AM, Karl Wright <daddywri@gmail.com
>> <mailto:daddywri@gmail.com>> wrote:
>>
>>     What version of ManifoldCF is this?
>>
>>     I ask because I updated the logging output in 1.3 to capture a
>>     number of cases that previously did not log a reason why they were
>>     skipped.
>>
>>     Karl
>>
>>
>>
>>     On Tue, Aug 13, 2013 at 5:27 AM, Erlend Garåsen
>>     <e.f.garasen@usit.uio.no <mailto:e.f.garasen@usit.uio.**no<e.f.garasen@usit.uio.no>>>
>> wrote:
>>
>>
>>         OK, I have now changed the log level from INFO to DEBUG for
>>         connectors as well. Here's the log:
>>         http://folk.uio.no/erlendfg/__**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log>
>>
>>         <http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
>> >
>>
>>         The following entry indicates that one of the missing URLs is
>>         found/extracted from a link:
>>         DEBUG 2013-08-13 10:58:48,630 (Worker thread '9') - WEB: In html
>>         document 'http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>>
>>         <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>>',
>> found link to
>>         'http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml>
>>
>>         <http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>
>> >'
>>
>>         Then the job just ends and all the extracted links were never
>>         fetched.
>>
>>         Erlend
>>
>>
>>         On 8/12/13 5:15 PM, Erlend Garåsen wrote:
>>
>>
>>             Thanks, I will tomorrow and report thereafter. I hope we
>>             will find a
>>             simple explanation. :)
>>
>>             E
>>
>>             On 8/12/13 5:07 PM, Karl Wright wrote:
>>
>>                 Hi Erlend,
>>
>>                 You have wire logging (httpclient) enabled, which is
>>                 useful for
>>                 debugging fetch issues, but you do not have connector
>>                 debugging on.  To
>>                 turn it on, add this to properties.xml:
>>
>>                 <property name="org.apache.manifoldcf.__**connectors"
>>
>>                 value="DEBUG"/>
>>
>>                 thanks,
>>                 Karl
>>
>>
>>                 On Mon, Aug 12, 2013 at 10:53 AM, Erlend Garåsen
>>                 <e.f.garasen@usit.uio.no
>>                 <mailto:e.f.garasen@usit.uio.**no<e.f.garasen@usit.uio.no>
>> >
>>                 <mailto:e.f.garasen@usit.uio._**_no
>>
>>                 <mailto:e.f.garasen@usit.uio.**no<e.f.garasen@usit.uio.no>>>>
>> wrote:
>>
>>                      On 8/12/13 4:29 PM, Karl Wright wrote:
>>
>>                          Hi Erlend,
>>
>>                          The Document Status report shows these
>>                 documents because they
>>                          are still
>>                          in the queue.  The reasons for this could be
>>                 several.  Documents
>>                          that
>>                          exceed the hopcount by 1 level are allowed to
>>                 remain in the
>>                          queue for
>>                          bookkeeping purposes.  "scheduled date" as
>>                 given only meaningful
>>                          if the
>>                          document is in an active state; my guess is
>>                 that these documents
>>                          are not
>>                          in fact in that state, but rather in the state
>>                          HOPCOUNT_EXCEEDED.  Can
>>                          you include one complete row from the Document
>>                 Status report for
>>                          one of
>>                          the missing documents?
>>
>>
>>                      For "http://www.ibsen.uio.no/____**sakprosa.xhtml<http://www.ibsen.uio.no/____sakprosa.xhtml>
>>                 <http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml>
>> >
>>
>>                      <http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml>
>>                 <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml>
>> >>":
>>                      Job: Ibsen
>>
>>                      State: Out of scope
>>                      Status: Hopcount exceeded
>>                      Scheduled: 01-01-1970 01:00:00.000
>>                      Scheduled action: Process
>>                      Retry count: N/A
>>                      Retry limit: N/A
>>
>>
>>                          When you added documents to the seed list, what
>>                 did the Simple
>>                          History
>>                          say when they were fetched?  If they don't
>>                 appear in the simple
>>                          history,
>>                          they SHOULD have nevertheless appeared in the
>>                 log, with an
>>                          explanation
>>                          of why they were excluded, provided you have
>>                 connector debugging
>>                          enabled.
>>
>>
>>                      OK, here is the seed list:
>>                 http://www.ibsen.uio.no/
>>
>>                 http://www.ibsen.uio.no/____**skuespill.xhtml<http://www.ibsen.uio.no/____skuespill.xhtml>
>>                 <http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml>
>> >
>>                      <http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml>
>>                 <http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>
>> >>
>>                 http://www.ibsen.uio.no/dikt._**___xhtml<http://www.ibsen.uio.no/dikt.____xhtml>
>>                 <http://www.ibsen.uio.no/dikt.**__xhtml<http://www.ibsen.uio.no/dikt.__xhtml>
>> >
>>                      <http://www.ibsen.uio.no/dikt.**__xhtml<http://www.ibsen.uio.no/dikt.__xhtml>
>>                 <http://www.ibsen.uio.no/dikt.**xhtml<http://www.ibsen.uio.no/dikt.xhtml>
>> >>
>>                 http://www.ibsen.uio.no/brev._**___xhtml<http://www.ibsen.uio.no/brev.____xhtml>
>>                 <http://www.ibsen.uio.no/brev.**__xhtml<http://www.ibsen.uio.no/brev.__xhtml>
>> >
>>                      <http://www.ibsen.uio.no/brev.**__xhtml<http://www.ibsen.uio.no/brev.__xhtml>
>>                 <http://www.ibsen.uio.no/brev.**xhtml<http://www.ibsen.uio.no/brev.xhtml>
>> >>
>>                 http://www.ibsen.uio.no/____**sakprosa.xhtml<http://www.ibsen.uio.no/____sakprosa.xhtml>
>>                 <http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml>
>> >
>>                      <http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml>
>>                 <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml>
>> >>
>>                 http://www.ibsen.uio.no/varia.**____xhtml<http://www.ibsen.uio.no/varia.____xhtml>
>>                 <http://www.ibsen.uio.no/**varia.__xhtml<http://www.ibsen.uio.no/varia.__xhtml>
>> >
>>                      <http://www.ibsen.uio.no/__**varia.xhtml<http://www.ibsen.uio.no/__varia.xhtml>
>>                 <http://www.ibsen.uio.no/**varia.xhtml<http://www.ibsen.uio.no/varia.xhtml>
>> >>
>>                 http://www.ibsen.uio.no/____**
>> undervisningsressurser.xhtml<http://www.ibsen.uio.no/____undervisningsressurser.xhtml>
>>                 <http://www.ibsen.uio.no/__**undervisningsressurser.xhtml<http://www.ibsen.uio.no/__undervisningsressurser.xhtml>
>> >
>>
>>
>>                 <http://www.ibsen.uio.no/__**undervisningsressurser.xhtml<http://www.ibsen.uio.no/__undervisningsressurser.xhtml>
>>                 <http://www.ibsen.uio.no/**undervisningsressurser.xhtml<http://www.ibsen.uio.no/undervisningsressurser.xhtml>
>> >>
>>
>>                      Here is the results from simple history:
>>                      08-12-2013 16:46:26.536         job end
>>                 1368534065016(Ibsen)
>>                                       0       1
>>                      08-12-2013 16:46:09.927         document ingest
>> (Solr)
>>                 http://www.ibsen.uio.no/____**forside.xhtml<http://www.ibsen.uio.no/____forside.xhtml>
>>                 <http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>> >
>>
>>                      <http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>>                 <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
>> >>
>>                               OK      11897   178
>>                      08-12-2013 16:46:09.751         fetch
>>                 http://www.ibsen.uio.no/____**forside.xhtml<http://www.ibsen.uio.no/____forside.xhtml>
>>                 <http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>> >
>>
>>                      <http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>>                 <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
>> >>
>>                               200     11897   17
>>                      08-12-2013 16:44:48.829         fetch
>>                 http://www.ibsen.uio.no/
>>                               302     0       79484
>>                      08-12-2013 16:44:48.727         robots parse
>>                 www.ibsen.uio.no:80 <http://www.ibsen.uio.no:80>
>>                      <http://www.ibsen.uio.no:80>
>>
>>                               HTML    0       2       Robots file
>>                 contained HTML, skipped
>>                      08-12-2013 16:44:46.574         job start
>>                 1368534065016(Ibsen)
>>                                       0       1
>>                               1
>>
>>                      HttpClient log:
>>                 http://folk.uio.no/erlendfg/__**
>> __manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/____manifoldcf/manifoldcf.log><
>> http://folk.uio.no/erlendfg/_**_manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log>
>> >
>>
>>
>>                 <http://folk.uio.no/erlendfg/_**
>> _manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log>
>>                 <http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
>> >>
>>
>>                      Erlend
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message