manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Garåsen <e.f.gara...@usit.uio.no>
Subject Re: Hop count problem
Date Tue, 13 Aug 2013 12:43:39 GMT

I couldn't find more information in the log after the upgrade. Yes, I'm 
running version 1.3 now  since I had to log in after the upgrade:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log

I tried to fetch one of the missing documents by using Curl from our 
prod server. Looks like an OK response to me even though this is Curl 
and not HttpClient:

-bash-3.2$ curl -vvv -H "User-Agent: Mozilla/5.0 
(ApacheManifoldCFWebCrawler; sok-core@usit.uio.no)" 
"http://www.ibsen.uio.no/sakprosa.xhtml"
* About to connect() to www.ibsen.uio.no port 80
*   Trying 129.240.7.27... connected
* Connected to www.ibsen.uio.no (129.240.7.27) port 80
 > GET /sakprosa.xhtml HTTP/1.1
 > Host: www.ibsen.uio.no
 > Accept: */*
 > User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; 
sok-core@usit.uio.no)
 >
< HTTP/1.1 200 OK
< Date: Tue, 13 Aug 2013 12:40:02 GMT
< Server: Apache-Coyote/1.1
< X-Cocoon-Version: 2.1.12-dev
< Last-Modified: Fri, 09 Aug 2013 09:57:43 GMT
< Content-Type: text/html
< Content-Length: 11209
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
[...]

E

On 8/13/13 12:04 PM, Karl Wright wrote:
> If this is still 1.2, then these were the unlogged reasons why a
> document could be skipped:
>
> (1) Length too long
> (2) Output connector rejects mime type
> (3) Output connector rejects url
> (4) Document is not considered indexable according to the job
> constraints (the "indexable" regular expressions)
>
> Karl
>
>
>
> On Tue, Aug 13, 2013 at 5:56 AM, Karl Wright <daddywri@gmail.com
> <mailto:daddywri@gmail.com>> wrote:
>
>     What version of ManifoldCF is this?
>
>     I ask because I updated the logging output in 1.3 to capture a
>     number of cases that previously did not log a reason why they were
>     skipped.
>
>     Karl
>
>
>
>     On Tue, Aug 13, 2013 at 5:27 AM, Erlend Garåsen
>     <e.f.garasen@usit.uio.no <mailto:e.f.garasen@usit.uio.no>> wrote:
>
>
>         OK, I have now changed the log level from INFO to DEBUG for
>         connectors as well. Here's the log:
>         http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log
>         <http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
>
>         The following entry indicates that one of the missing URLs is
>         found/extracted from a link:
>         DEBUG 2013-08-13 10:58:48,630 (Worker thread '9') - WEB: In html
>         document 'http://www.ibsen.uio.no/__forside.xhtml
>         <http://www.ibsen.uio.no/forside.xhtml>', found link to
>         'http://www.ibsen.uio.no/__skuespill.xhtml
>         <http://www.ibsen.uio.no/skuespill.xhtml>'
>
>         Then the job just ends and all the extracted links were never
>         fetched.
>
>         Erlend
>
>
>         On 8/12/13 5:15 PM, Erlend Garåsen wrote:
>
>
>             Thanks, I will tomorrow and report thereafter. I hope we
>             will find a
>             simple explanation. :)
>
>             E
>
>             On 8/12/13 5:07 PM, Karl Wright wrote:
>
>                 Hi Erlend,
>
>                 You have wire logging (httpclient) enabled, which is
>                 useful for
>                 debugging fetch issues, but you do not have connector
>                 debugging on.  To
>                 turn it on, add this to properties.xml:
>
>                 <property name="org.apache.manifoldcf.__connectors"
>                 value="DEBUG"/>
>
>                 thanks,
>                 Karl
>
>
>                 On Mon, Aug 12, 2013 at 10:53 AM, Erlend Garåsen
>                 <e.f.garasen@usit.uio.no
>                 <mailto:e.f.garasen@usit.uio.no>
>                 <mailto:e.f.garasen@usit.uio.__no
>                 <mailto:e.f.garasen@usit.uio.no>>> wrote:
>
>                      On 8/12/13 4:29 PM, Karl Wright wrote:
>
>                          Hi Erlend,
>
>                          The Document Status report shows these
>                 documents because they
>                          are still
>                          in the queue.  The reasons for this could be
>                 several.  Documents
>                          that
>                          exceed the hopcount by 1 level are allowed to
>                 remain in the
>                          queue for
>                          bookkeeping purposes.  "scheduled date" as
>                 given only meaningful
>                          if the
>                          document is in an active state; my guess is
>                 that these documents
>                          are not
>                          in fact in that state, but rather in the state
>                          HOPCOUNT_EXCEEDED.  Can
>                          you include one complete row from the Document
>                 Status report for
>                          one of
>                          the missing documents?
>
>
>                      For "http://www.ibsen.uio.no/____sakprosa.xhtml
>                 <http://www.ibsen.uio.no/__sakprosa.xhtml>
>                      <http://www.ibsen.uio.no/__sakprosa.xhtml
>                 <http://www.ibsen.uio.no/sakprosa.xhtml>>":
>                      Job: Ibsen
>
>                      State: Out of scope
>                      Status: Hopcount exceeded
>                      Scheduled: 01-01-1970 01:00:00.000
>                      Scheduled action: Process
>                      Retry count: N/A
>                      Retry limit: N/A
>
>
>                          When you added documents to the seed list, what
>                 did the Simple
>                          History
>                          say when they were fetched?  If they don't
>                 appear in the simple
>                          history,
>                          they SHOULD have nevertheless appeared in the
>                 log, with an
>                          explanation
>                          of why they were excluded, provided you have
>                 connector debugging
>                          enabled.
>
>
>                      OK, here is the seed list:
>                 http://www.ibsen.uio.no/
>
>                 http://www.ibsen.uio.no/____skuespill.xhtml
>                 <http://www.ibsen.uio.no/__skuespill.xhtml>
>                      <http://www.ibsen.uio.no/__skuespill.xhtml
>                 <http://www.ibsen.uio.no/skuespill.xhtml>>
>                 http://www.ibsen.uio.no/dikt.____xhtml
>                 <http://www.ibsen.uio.no/dikt.__xhtml>
>                      <http://www.ibsen.uio.no/dikt.__xhtml
>                 <http://www.ibsen.uio.no/dikt.xhtml>>
>                 http://www.ibsen.uio.no/brev.____xhtml
>                 <http://www.ibsen.uio.no/brev.__xhtml>
>                      <http://www.ibsen.uio.no/brev.__xhtml
>                 <http://www.ibsen.uio.no/brev.xhtml>>
>                 http://www.ibsen.uio.no/____sakprosa.xhtml
>                 <http://www.ibsen.uio.no/__sakprosa.xhtml>
>                      <http://www.ibsen.uio.no/__sakprosa.xhtml
>                 <http://www.ibsen.uio.no/sakprosa.xhtml>>
>                 http://www.ibsen.uio.no/varia.____xhtml
>                 <http://www.ibsen.uio.no/varia.__xhtml>
>                      <http://www.ibsen.uio.no/__varia.xhtml
>                 <http://www.ibsen.uio.no/varia.xhtml>>
>                 http://www.ibsen.uio.no/____undervisningsressurser.xhtml
>                 <http://www.ibsen.uio.no/__undervisningsressurser.xhtml>
>
>                 <http://www.ibsen.uio.no/__undervisningsressurser.xhtml
>                 <http://www.ibsen.uio.no/undervisningsressurser.xhtml>>
>
>                      Here is the results from simple history:
>                      08-12-2013 16:46:26.536         job end
>                 1368534065016(Ibsen)
>                                       0       1
>                      08-12-2013 16:46:09.927         document ingest (Solr)
>                 http://www.ibsen.uio.no/____forside.xhtml
>                 <http://www.ibsen.uio.no/__forside.xhtml>
>                      <http://www.ibsen.uio.no/__forside.xhtml
>                 <http://www.ibsen.uio.no/forside.xhtml>>
>                               OK      11897   178
>                      08-12-2013 16:46:09.751         fetch
>                 http://www.ibsen.uio.no/____forside.xhtml
>                 <http://www.ibsen.uio.no/__forside.xhtml>
>                      <http://www.ibsen.uio.no/__forside.xhtml
>                 <http://www.ibsen.uio.no/forside.xhtml>>
>                               200     11897   17
>                      08-12-2013 16:44:48.829         fetch
>                 http://www.ibsen.uio.no/
>                               302     0       79484
>                      08-12-2013 16:44:48.727         robots parse
>                 www.ibsen.uio.no:80 <http://www.ibsen.uio.no:80>
>                      <http://www.ibsen.uio.no:80>
>
>                               HTML    0       2       Robots file
>                 contained HTML, skipped
>                      08-12-2013 16:44:46.574         job start
>                 1368534065016(Ibsen)
>                                       0       1
>                               1
>
>                      HttpClient log:
>                 http://folk.uio.no/erlendfg/____manifoldcf/manifoldcf.log <http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log>
>
>                 <http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log
>                 <http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>>
>
>                      Erlend
>
>
>
>
>
>


Mime
View raw message