manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions
Date Mon, 19 Mar 2012 04:46:45 GMT
Abe-san,

Thank you for the info.

That's a good idea. Hope I can avoid the job interruption in this way.


Regards,

Shigeki

2012/3/19 Shinichiro Abe <shinichiro.abe.1@gmail.com>

> Hi,
>
> Currently MCF can't ignore 500 server error which is caused by Solr.
> If you can upgrade to Solr 3.2, you can specify ignoreTikaException.
> https://issues.apache.org/jira/browse/SOLR-2480
> Hope that helps.
>
> Regards,
> Shinichiro Abe
>
> On 2012/03/19, at 12:55, Shigeki Kobayashi wrote:
>
> > Karl,
> >
> >
> > Thanks for your reply.
> >
> > It seems that Tika failed on extracting documents from PDF files while
> crawling web links down. I confirmed there were Tika Exception subsequently
> to Solr Exception.
> >
> > So, Solr detecting Tika Exception sends a status code, 500, then MCF
> retries ingesting certain times:
> >
> > "500 from ingestion request; ingestion will be retried again later"
> >
> > After all, MCF shuts down the entire job.
> >
> > I know I should up grade the Solr version (including Tika), to improve a
> job in document extraction. But, the current version of Tika still fails in
> document extraction sometimes anyway, i feel it would make more sense that
> MCF ignores and proceeds after such ingestion error caused by Tika.
> >
> > Are there any such specification requests from users that MCF ignores
> and proceeds after failure of document ingestion caused by Tika, maybe in
> the next release?
> >
> > Are there any options that users can choose to have MCF ignore and
> proceed after such ingestion error?
> >
> >
> > regards,
> >
> > Shigeki
> >
> > 2012/3/16 Karl Wright <daddywri@gmail.com>
> > Hi Shigeki,
> >
> > A "service interruption" means that a connector (either a repository
> > connector like the web connector or an output connector like the Solr
> > connector) could not communicate with the configured service.
> >
> > "Repeated service interruptions" means that certain URLs failed to
> > fetch properly even after a pattern of retries which lasted many
> > hours.  ManifoldCF connectors deal with such errors in one of several
> > ways, depending on the exact details of the error:
> >
> > - ignore it and proceed
> > - retry periodically for some time interval, and then give up and proceed
> > - retry periodically for some time interval, and then shut down the job
> >
> > It sounds like your job has encountered one of the latter errors.  The
> > "Error: Repeated service interruptions - failure processing document:
> > Ingestion HTTP error code 500" indicates that the problem is due to
> > communication with Solr.  Apparently certain documents you are
> > indexing are causing Solr to return an error code 500, which is an
> > "internal server error", and is usually associated with a Solr
> > exception.  You will need to diagnose why this is, and take corrective
> > steps, in order for your ManifoldCF job to complete successfully.
> >
> > "Job no longer active" is harmless - it's a side effect of the job
> > shutting down.  When a job is shutting down, active document
> > processing cannot always be interrupted within a connector, but the
> > framework helps it to stop quickly by throwing this exception.
> >
> > Thanks,
> > Karl
> >
> >
> > 2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <shigeki.kobayashi3@g.softbank.co.jp
> >:
> > >
> > > I was crawling web sites with links to html and pdf files on the
> provided
> > > multiprocess-example agent for a few hours, then Simple History started
> > > showing -104 result code with a message saying "Interrupted: Job no
> longer
> > > active".
> > >
> > > After the same error occurred repeatedly around 40 times, the job
> status
> > > became "Aborting" and then ended up with "Error: Repeated service
> > > interruptions
> > > - failure processing document: Ingestion HTTP error code 500".
> > >
> > > The job was interrupted and stopped.
> > >
> > > Does anyone know what situation brings "Repeated service
> interruptions" and
> > > has jobs stopped?
> > > Also in what circumstance an error status code -104 occurs? What is the
> > > meaning of the code -104?
> > >
> > > If you have any ideas, please advise me on how to avoid this error.
> > >
> > >
> > > I am using the followings:
> > >
> > > Solr 1.4 (Extracting Request Handler is set)
> > > ManifoldCF 0.4 (multiprocess-example)
> > > - Repository connector: WEB
> > > - Output connector: Solr
> > > Tomcat 6.0.29
> > > PostgreSQL 9.1.3
> > >
> > >
> > > Here is MCF’s debug log right before the job was interrupted:
> > >
> > > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to
> get
> > > connection to http://xx.xx.xx.xx:80 (95697 ms)
> > > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895
> ms
> > > before starting fetch on http://xx.xx.xx.xx:80
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to
> get
> > > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully
> got
> > > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
> > > HttpClient object
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an
> HttpClient
> > > object after 0 ms.
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
> > > '/xx/xx.pdf'
> > > DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
> > > http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
> > > DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 128 ms.
> > > DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > >  INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH
> > > URL|
> http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> > > Interrupted: Job no longer active
> > > DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch
> exception for
> > > 'http://xx.xx/xx/xx.pdf'
> > > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Interrupted: Job
> > > no longer active
> > >         at
> > >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
> > >         at
> > >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
> > >         at
> > >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
> > > Caused by:
> org.apache.manifoldcf.agents.interfaces.ServiceInterruption: Job
> > > no longer active
> > >         at
> > >
> org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
> > >         at
> > >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
> > >         at
> > >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
> > >         ... 1 more
> > >  WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
> > > interruption reported for job 1331716457096 connection 'web': Job no
> longer
> > > active
> > > DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job
> 1331716457096
> > > DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
> > > 1331716457096 in need of notification
> >
> >
> >
> > --
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >  ソフトバンクモバイル株式会社
> >  情報システム本部
> >  システムサービス事業統括部
> >  サービス企画部
> >
> >  小林 茂樹
> >  shigeki.kobayashi3@g.softbank.co.jp
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >
> >
>
>


-- 
*~~~~~~~~~~~~~~~~~~~~**~~~~*
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部

 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
*~~~~~~~~~~~~~~~~~~~~**~~~~*

Mime
View raw message