I posted the pertinent question to the solr dev list.  Let's see what they say.

Thanks,
Karl


On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <daddywri@gmail.com> wrote:
Hi,

The exception in the solr.log should be reported as a Solr bug.  It is not emanating from the Tika extractor (Solr Cell), but is in Solr itself.

I wish there was an easy fix for this.  The problem is *not* an empty stream; it's that Solr is attempting to do something with it that it shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover from that.

>>>>>>
https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5 (500)
<<<<<<

Karl




On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <tthamizharasan@worldbankgroup.org> wrote:

Hi Karl,

 

After configuring Solr to ignore Tika errors by adding Tika transformer in the job, below behavior is observed.

 

1)      ManifoldCF fetches the content from documentum, which contains null content and tries to push it to the output connector(Solr).

2)      Solr couldn’t accept the null as a value and throwing “Missing content stream” error.

3)      Each agent thread In ManifoldCF internally held-up with different r_object_id’s that don’t have body content and keeps trying to push the content to Solr  after each failure, but Solr couldn’t accept the content and throws the same error.

4)      Over the time, the manifold job stops with the error thrown by Solr

 

Please let know if there is any configuration change which can help us resolve this issue.

 

Please find the attached manifoldCF error log,Solr error log and agent log.

 

Regards,

Tamizh Kumaran.

 

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: Tuesday, June 13, 2017 2:23 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

 

Hi Tamizh,

 

The reported error is 'Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188'.  The message seemingly indicates that the error was *received* from the solr server for one specific document.  ManifoldCF does not recognize the error as being innocuous and therefore it will retry for a while until it eventually gives up and halts the job.  However, I cannot find that exact text anywhere in the Solr output connector code, so I wonder if you transcribed it correctly?

There should also be the following:

(1) A record of the attempts in the manifoldcf.log file, with a MCF stack trace attached to each one;

(2) Simple history records for that document that are of the type INGESTDOCUMENT.

(3) Solr log entries that have a Solr stack trace.

 

The last one is the one that would be the most helpful.  It is possible that you are seeing a problem in Solr Cell (Tika) that is manifesting itself in this way.  You can (and should) configure your Solr to ignore Tika errors.

 

Thanks,

Karl

 

 

 

 

On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <tthamizharasan@worldbankgroup.org> wrote:

Hi,

 

The Manifoldcf 2.7.1 is running in the multiprocess zk model and integrated with PostgreSQL 9.3. The expected setup is to crawl the Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui app is installed on the tomcat and startup script is pointed with the MF properties.xml during server startup. Manifold along with the bundled ZK, tomcat are running on the same host with OS as  Red Hat Enterprise Linux Server release 6.9 (Santiago). The DB is running on a windows box.

The ZK is integrated with the DB through the properties.xml and properties-global.xml

The ZK, the documentum related processes(registry and server) are up and the  two agents (start-agents.sh and start-agents-2.sh) are started  which produce multiple threads to index the documemtum contents into SOLR through ManifoldCF.

 

The Current no of the connections configured on the MF are as below.

SOLR Output max connection : 25

Document repository  Max Connections: 25

Properties.xml:

<property name="org.apache.manifoldcf.database.maxhandles" value="50"/>

<property name="org.apache.manifoldcf.crawler.threads" value="25"/>

Total documentum document count : 0.5 million

 

After the Job is started, it indexed some 20000+ documents and gets terminated with the below error on the Manifold JOB.

Error: Repeated service interruptions - failure processing document: Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188

 

Please find the attached manifoldCF error log and agent log.

 

Please let me know the observations on the cause of the issue and the configuration on the threads used  for crawling. Please share your thoughts.

 

Regards,

Tamizh Kumaran