manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused
Date Fri, 11 May 2012 09:07:26 GMT
Shigeki,

There are dozens of individual kinds of error that the Web Connector
detects and retries for; it would of course be possible to allow users
to set parameters to control all of them but it seems to me like it
would be almost too much freedom.  And, like I said initially, one
prime reason for the retry strategies of each error type is to avoid
having ManifoldCF behave badly and get blocked by the webmaster of the
site being crawled.

Having said that, if you have a case for changing the strategy for any
particular kind of error, we can certainly look into that.

In the case of connect exceptions, because there is a fairly long
socket timeout trying to connect (it's measured in minutes), and
because attempting to connect ties up a worker thread for that whole
time, you really don't want to retry too frequently.  You could make
the case for retrying for a longer period of time (say, 12 or 24
hours), or for slightly more frequently (1 hour instead of 2 hours).
If you have a case for doing that please go ahead and create a ticket.

Thanks,
Karl



On Thu, May 10, 2012 at 10:09 PM, 小林 茂樹(情報システム本部 / サービス企画部)
<shigeki.kobayashi3@g.softbank.co.jp> wrote:
> Karl,
>
>> There should be a "Scheduled" value also listed which is *when* the URL
>> will be retried
>
> So, I see valuse in "Scheduled" and "Retry Limit". The next re-crawling is
> two hours later and the final crawling is six hour later. It sounds like too
> much waiting. Are you guys planning to create new feature you can change
> these waiting periods, or a such thing already exists?
>
> Thanks for sharing your knowledge.
>
> Best regards,
>
> Shigeki
>
> 2012/5/10 Karl Wright <daddywri@gmail.com>
>>
>> "Waiting for Processing" means that the URL will be retried.  There
>> should be a "Scheduled" value also listed which is *when* the URL will
>> be retried, and a "Scheduled action" column that says "Process".  If
>> you see these things you only need to wait until the time specified
>> and the document will be recrawled.
>>
>> Karl
>>
>> On Wed, May 9, 2012 at 9:54 PM, 小林 茂樹(情報システム本部 / サービス企画部)
>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> > Karl,
>> >
>> > Thanks for the reply.
>> >
>> >
>> >> For web crawling, no single URL failure will cause the job to
>> > abort;
>> >
>> > OK, so I understand if I want it stopped, I need to manually abort the
>> > job.
>> >
>> >
>> >> You can check on the status of an individual URL by using the Document
>> > Status report.
>> >
>> > The Document Status report says the seed URL is "Waiting for
>> > Proecssing",
>> > which makes sense because the connection is refused. The report does not
>> > show retry count.
>> >
>> > The MCF log outputs exception. Is this also expected behavior?:
>> > -----
>> >
>> > DEBUG 2012-05-10 10:10:48,215 (Worker thread '34') - WEB: Fetch
>> > exception
>> > for 'http://xxx.xxx.xxx/index.html'
>> >
>> > java.net.ConnectException: Connection refused
>> >
>> >     at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >
>> >     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
>> >
>> >     at
>> > java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
>> >
>> >     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
>> >
>> >     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>> >
>> >     at java.net.Socket.connect(Socket.java:529)
>> >
>> >     at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>> >
>> >     at
>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >
>> >     at java.lang.reflect.Method.invoke(Method.java:597)
>> >
>> >     at
>> >
>> > org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(Unknown
>> > Source)
>> >
>> >     at
>> >
>> > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(Unknown
>> > Source)
>> >
>> >     at org.apache.commons.httpclient.HttpConnection.open(Unknown Source)
>> >
>> >     at
>> >
>> > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(Unknown
>> > Source)
>> >
>> >     at
>> >
>> > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Unknown
>> > Source)
>> >
>> >     at
>> > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
>> > Source)
>> >
>> >     at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown
>> > Source)
>> >
>> >     at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)
>> >
>> >  WARN 2012-05-10 10:10:48,216 (Worker thread '34') - Pre-ingest service
>> > interruption reported for job 1335340623530 connection 'WEB': Timed out
>> > waiting for a connection for 'http://xxx.xxx.xxx/index.html': Connection
>> > refused
>> >
>> >
>> > -----
>> >
>> > Regards,
>> >
>> > Shigeki
>> >
>> >
>> > 2012/5/9 Karl Wright <daddywri@gmail.com>
>> >>
>> >> Hi,
>> >>
>> >> ManifoldCF's web connector is, in general, very cautious about not
>> >> offending the owners of sites.  If it concludes that the site has
>> >> blocked access to a URL, it may remove the URL from its queue for
>> >> politeness, which would prevent further crawling of that URL for the
>> >> duration of the current job.  Under most cases, however, if a URL is
>> >> temporarily unavailable, it will be requeued for crawling at a later
>> >> time.  The typical pattern is to attempt to recrawl the URL
>> >> periodically (e.g. every 5 minutes) for many hours before giving up on
>> >> it.  For web crawling, no single URL failure will cause the job to
>> >> abort; it will continue running until all the other URLs have been
>> >> processed or forever (if the job is continuous).
>> >>
>> >> You can check on the status of an individual URL by using the Document
>> >> Status report.  This report should tell you what ManifoldCF intends to
>> >> do with a specific document.  If you locate one such URL and try out
>> >> this report, what does it say?
>> >>
>> >> Karl
>> >>
>> >>
>> >> On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部
/ サービス企画部)
>> >> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >> >
>> >> > Hi guys.
>> >> >
>> >> >
>> >> >
>> >> > I need some advice for stopping the MCF web crawler from a running
>> >> > state
>> >> > when a network connection refused.
>> >> >
>> >> >
>> >> >
>> >> > I use MCF 0.5 with Solr 3.5. I was testing what would happen to the
>> >> > web
>> >> > crawler when shutting down the web site that is to be crawled. I
>> >> > checked
>> >> > the
>> >> > simple history and saw “Connection refused” with status code of
“-1”,
>> >> > that
>> >> > looked fine. But as I was waiting, the job status never changed and
>> >> > remained
>> >> > running. The crawler never crawls in this situation, but when I
>> >> > opened
>> >> > the
>> >> > web site, the crawler never started crawling again either.
>> >> >
>> >> > At least, somehow, I want the crawler to stop from running when a
>> >> > network
>> >> > connection refused, but I don’t know how. Does anyone have any ideas?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >
>> >
>
>
>
>

Mime
View raw message