manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF documentum indexing issue
Date Wed, 14 Jun 2017 13:35:44 GMT
I posted the pertinent question to the solr dev list.  Let's see what they
say.

Thanks,
Karl


On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi,
>
> The exception in the solr.log should be reported as a Solr bug.  It is not
> emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>
> I wish there was an easy fix for this.  The problem is *not* an empty
> stream; it's that Solr is attempting to do something with it that it
> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
> from that.
>
> >>>>>>
> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5
> (500)
> <<<<<<
>
> Karl
>
>
>
>
> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
>> Hi Karl,
>>
>>
>>
>> After configuring Solr to ignore Tika errors by adding Tika transformer
>> in the job, below behavior is observed.
>>
>>
>>
>> 1)      ManifoldCF fetches the content from documentum, which contains
>> null content and tries to push it to the output connector(Solr).
>>
>> 2)      Solr couldn’t accept the null as a value and throwing “Missing
>> content stream” error.
>>
>> 3)      Each agent thread In ManifoldCF internally held-up with
>> different r_object_id’s that don’t have body content and keeps trying to
>> push the content to Solr  after each failure, but Solr couldn’t accept the
>> content and throws the same error.
>>
>> 4)      Over the time, the manifold job stops with the error thrown by
>> Solr
>>
>>
>>
>> Please let know if there is any configuration change which can help us
>> resolve this issue.
>>
>>
>>
>> Please find the attached manifoldCF error log,Solr error log and agent
>> log.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran.
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>> *To:* user@manifoldcf.apache.org
>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>> *Subject:* Re: ManifoldCF documentum indexing issue
>>
>>
>>
>> Hi Tamizh,
>>
>>
>>
>> The reported error is 'Error from server at http://localhost:8983/solr/
>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>> message seemingly indicates that the error was *received* from the solr
>> server for one specific document.  ManifoldCF does not recognize the error
>> as being innocuous and therefore it will retry for a while until it
>> eventually gives up and halts the job.  However, I cannot find that exact
>> text anywhere in the Solr output connector code, so I wonder if you
>> transcribed it correctly?
>>
>> There should also be the following:
>>
>> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
>> trace attached to each one;
>>
>> (2) Simple history records for that document that are of the type
>> INGESTDOCUMENT.
>>
>> (3) Solr log entries that have a Solr stack trace.
>>
>>
>>
>> The last one is the one that would be the most helpful.  It is possible
>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>> itself in this way.  You can (and should) configure your Solr to ignore
>> Tika errors.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>> tthamizharasan@worldbankgroup.org> wrote:
>>
>> Hi,
>>
>>
>>
>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>> app is installed on the tomcat and startup script is pointed with the MF
>> properties.xml during server startup. Manifold along with the bundled ZK,
>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>
>> The ZK is integrated with the DB through the properties.xml and
>> properties-global.xml
>>
>> The ZK, the documentum related processes(registry and server) are up and
>> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
>> produce multiple threads to index the documemtum contents into SOLR through
>> ManifoldCF.
>>
>>
>>
>> The Current no of the connections configured on the MF are as below.
>>
>> SOLR Output max connection : 25
>>
>> Document repository  Max Connections: 25
>>
>> Properties.xml:
>>
>> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>>
>> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>>
>> Total documentum document count : 0.5 million
>>
>>
>>
>> After the Job is started, it indexed some 20000+ documents and gets
>> terminated with the below error on the Manifold JOB.
>>
>> Error: Repeated service interruptions - failure processing document:
>> Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
>> String index out of range: -188
>>
>>
>>
>> Please find the attached manifoldCF error log and agent log.
>>
>>
>>
>> Please let me know the observations on the cause of the issue and the
>> configuration on the threads used  for crawling. Please share your thoughts.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran
>>
>>
>>
>>
>>
>
>

Mime
View raw message