manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 小林 茂樹(情報システム本部 / サービス企画部) <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused
Date Fri, 11 May 2012 02:09:19 GMT
Karl,

> There should be a "Scheduled" value also listed which is *when* the URL
will be retried

So, I see valuse in "Scheduled" and "Retry Limit". The next re-crawling is
two hours later and the final crawling is six hour later. It sounds like
too much waiting. Are you guys planning to create new feature you can
change these waiting periods, or a such thing already exists?

Thanks for sharing your knowledge.

Best regards,

Shigeki

2012/5/10 Karl Wright <daddywri@gmail.com>

> "Waiting for Processing" means that the URL will be retried.  There
> should be a "Scheduled" value also listed which is *when* the URL will
> be retried, and a "Scheduled action" column that says "Process".  If
> you see these things you only need to wait until the time specified
> and the document will be recrawled.
>
> Karl
>
> On Wed, May 9, 2012 at 9:54 PM, 小林 茂樹(情報システム本部 / サービス企画部)
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> > Karl,
> >
> > Thanks for the reply.
> >
> >
> >> For web crawling, no single URL failure will cause the job to
> > abort;
> >
> > OK, so I understand if I want it stopped, I need to manually abort the
> job.
> >
> >
> >> You can check on the status of an individual URL by using the Document
> > Status report.
> >
> > The Document Status report says the seed URL is "Waiting for Proecssing",
> > which makes sense because the connection is refused. The report does not
> > show retry count.
> >
> > The MCF log outputs exception. Is this also expected behavior?:
> > -----
> >
> > DEBUG 2012-05-10 10:10:48,215 (Worker thread '34') - WEB: Fetch exception
> > for 'http://xxx.xxx.xxx/index.html'
> >
> > java.net.ConnectException: Connection refused
> >
> >     at java.net.PlainSocketImpl.socketConnect(Native Method)
> >
> >     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> >
> >     at
> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> >
> >     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> >
> >     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> >
> >     at java.net.Socket.connect(Socket.java:529)
> >
> >     at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> >
> >     at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >
> >     at java.lang.reflect.Method.invoke(Method.java:597)
> >
> >     at
> >
> org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(Unknown
> > Source)
> >
> >     at
> >
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(Unknown
> > Source)
> >
> >     at org.apache.commons.httpclient.HttpConnection.open(Unknown Source)
> >
> >     at
> >
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(Unknown
> > Source)
> >
> >     at
> > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Unknown
> > Source)
> >
> >     at
> > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
> > Source)
> >
> >     at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown
> > Source)
> >
> >     at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)
> >
> >  WARN 2012-05-10 10:10:48,216 (Worker thread '34') - Pre-ingest service
> > interruption reported for job 1335340623530 connection 'WEB': Timed out
> > waiting for a connection for 'http://xxx.xxx.xxx/index.html': Connection
> > refused
> >
> >
> > -----
> >
> > Regards,
> >
> > Shigeki
> >
> >
> > 2012/5/9 Karl Wright <daddywri@gmail.com>
> >>
> >> Hi,
> >>
> >> ManifoldCF's web connector is, in general, very cautious about not
> >> offending the owners of sites.  If it concludes that the site has
> >> blocked access to a URL, it may remove the URL from its queue for
> >> politeness, which would prevent further crawling of that URL for the
> >> duration of the current job.  Under most cases, however, if a URL is
> >> temporarily unavailable, it will be requeued for crawling at a later
> >> time.  The typical pattern is to attempt to recrawl the URL
> >> periodically (e.g. every 5 minutes) for many hours before giving up on
> >> it.  For web crawling, no single URL failure will cause the job to
> >> abort; it will continue running until all the other URLs have been
> >> processed or forever (if the job is continuous).
> >>
> >> You can check on the status of an individual URL by using the Document
> >> Status report.  This report should tell you what ManifoldCF intends to
> >> do with a specific document.  If you locate one such URL and try out
> >> this report, what does it say?
> >>
> >> Karl
> >>
> >>
> >> On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部 / サービス企画部)
> >> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >> >
> >> > Hi guys.
> >> >
> >> >
> >> >
> >> > I need some advice for stopping the MCF web crawler from a running
> state
> >> > when a network connection refused.
> >> >
> >> >
> >> >
> >> > I use MCF 0.5 with Solr 3.5. I was testing what would happen to the
> web
> >> > crawler when shutting down the web site that is to be crawled. I
> checked
> >> > the
> >> > simple history and saw “Connection refused” with status code of “-1”,
> >> > that
> >> > looked fine. But as I was waiting, the job status never changed and
> >> > remained
> >> > running. The crawler never crawls in this situation, but when I opened
> >> > the
> >> > web site, the crawler never started crawling again either.
> >> >
> >> > At least, somehow, I want the crawler to stop from running when a
> >> > network
> >> > connection refused, but I don’t know how. Does anyone have any ideas?
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >
> >
>

Mime
View raw message