manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Documentum job stops on error
Date Fri, 14 Jul 2017 11:02:08 GMT
I have created a ticket (CONNECTORS-1444) to track this issue, and attached
a fix.  I've also committed the fix to trunk.

The fix is not the code change you have done, but instead introduces a new
kind of DocumentumException: CORRUPTEDDOCUMENT.  This will be thrown
whenever permanent document corruption is detected, and will cause the
document to be skipped and not indexed.

The "DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED " error should cause the
connector to retry the document at a later time, so if indeed this is not a
permanent error, no special fix should be required.

Please let me know if the fix I have committed works for you.

Karl



On Fri, Jul 14, 2017 at 5:41 AM, Tamizh Kumaran Thamizharasan <
tthamizharasan@worldbankgroup.org> wrote:

> Hi Karl,
>
>
>
> Sorry for not explaining the issue in a detail manner.
>
> (1)   Is it likely to go away or not on a retry;
>
> The DM_PLATFORM_E_INTEGER_CONVERSION_ERROR and DM_OBJECT_E_LOAD_INVALID_STRING_LEN
> error are not likely to go away on immediate retry.
>
> (2)   Does it substantially impact the ability of ManifoldCF to properly
> process the document;
>
> The impact is someone need to monitor the indexing and if it gets stopped
> on these issues, need to use the restart-minimal to start the indexing
> again.
>
> (3) Is it generally acceptable to skip ALL documents where the error
> occurs.
>
> Yes, those errors are occurred for a large number of documents and its
> tough time for the user to restart the indexing again. Total documents
> count - 700000+
>
> DM_OBJECT_E_LOAD_INVALID_STRING_LEN  - 11147
>
> DM_PLATFORM_E_INTEGER_CONVERSION_ERROR  21708
>
> Im not sure whether the occurrences of these issues are common on the
> documentum / due to improper documentum configuration/maintenance. We have
> encountered those errors on a couple of the documentum instances of lower
> environments (Not validated on production).
>
>
>
> The documentum repository errors DM_PLATFORM_E_INTEGER_CONVERSION_ERROR
> and DM_OBJECT_E_LOAD_INVALID_STRING_LEN are of type DfException caused
> from the getObjectByQualification  method in the
> org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl.
>
>
>
> We made a fix to print the error on the log(documentum server process) and
> return null.
>
> *    catch* (DfException e)
>
>     {
>
>
>
>       e.printStackTrace();
>
>       *return* *null*;
>
>       //throw new DocumentumException("Documentum error:
> "+e.getMessage());
>
>     }
>
>
>
>
>
> On the run() method of the  ProcessDocumentThread inner class on  the
> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM file,  if did a null
> check to continue with the document processing.
>
> *try*
>
>       {
>
> IDocumentumObject object = session.getObjectByQualification("dm_document
> where i_chronicle_id='" + documentIdentifier +
>
>           "' and any r_version_label='CURRENT'");
>
>         *if*(object!=*null*) {
>
> …
>
> }
>
>       }
>
>       *catch* (Throwable e)
>
>       {
>
>         *this*.exception = e;
>
>       }
>
>
>
> The [DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED error occurs very rarely
> due to the document uploaded is parked in interim BOCS and moved to
> Repository after a shorter time.
>
> If indexing happens on the gap, the properties will be accessible, but the
> document content will not be available that causes the error. The fix is
> not yet completed.
>
> The code snippet that causes this error is shared below.
>
> The run() method of the  ProcessDocumentThread inner class on  the
> org.apache.manifoldcf.crawler.connectors.DCTM.DCTM
>
> *   try*
>
>           {
>
>             strFilePath = object.getFile(objFileTemp.getCanonicalPath());
>
>           }
>
>           *catch* (DocumentumException dfe)
>
>           {
>
>             // Fetch failed, so log it
>
>             activityStatus = "NOCONTENT";
>
>             activityMessage = dfe.getMessage();
>
>             *if* (dfe.getType() != DocumentumException.TYPE_NOTALLOWED)
>
>               *throw* dfe;
>
>             *return*;
>
>           }
>
>
>
> The getFile method on the org.apache.manifoldcf.crawler.common.DCTM.
> DocumentumObjectImpl
>
>
>
>     *catch* (DfException dfe)
>
>     {
>
>       // Can't decide what to do without looking at the exception text.
>
>       // This is crappy but it's the best we can manage, apparently.
>
>       String errorMessage = dfe.getMessage();
>
>       *if* (errorMessage.indexOf("[DM_CONTENT_E_CANT_START_PULL]") == -1)
>
>         // Treat it as transient, and retry
>
>         *throw* *new* DocumentumException(dfe.getMessage(),
> DocumentumException.TYPE_SERVICEINTERRUPTION);
>
>       // It's probably not a transient error.  Report it as an access
> violation, even though it
>
>       // may well not be.  We don't have much info as to what's happening.
>
>       *throw* *new* DocumentumException(dfe.getMessage(),
> DocumentumException.TYPE_NOTALLOWED);
>
>     }
>
>
>
> The approach to discard uncrawlable documents and continue with the
> indexing process is meaningful rather than stalling it. If you feel it is
> good to include, kindly do the required coding exception.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Friday, July 14, 2017 12:36 PM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: Documentum job stops on error
>
>
>
> Hi Tamizh,
>
>
>
> For any repository  errors, ManifoldCF needs to know the following:
>
> (1) Is it likely to go away or not on a retry;
>
> (2) Does it substantially impact the ability of ManifoldCF to properly
> process the document;
>
> (3) Is it generally acceptable to skip ALL documents where the error
> occurs.
>
>
>
> In this case your underlying error seems quite worrying:
>
>
>
> [DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED]error: "The content is
> temporarily parked on a BOCS server host. It will be available when it is
> moved to a permanent storage area."
>
> I could imagine that many or most documents are in fact in that state, in
> which case nothing can really be crawled?
>
>
>
> I'm happy to make coding exceptions in the Documentum connector for
> discarding uncrawlable documents, but only if it makes sense to do that.
> Here it is not clear at all that we'd want to change MCF to throw away all
> documents with this problem.  It sounds instead like there's some
> significant Documentum configuration issue to me.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Fri, Jul 14, 2017 at 2:39 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi Team,
>
>
>
> Below behavior is observed on using ManifoldCF Documentum connector.
>
>
>
> ·         On any Documentum specific error, the application throws the
> error and the job stops abruptly. If there is any specific reason for this
> approach?
>
> Can we handle these errors by logging the errors, ignoring the document
> and continue the indexing?
>
>
>
> Please find the sample error causing the job to fail.
>
>
>
> Documentum error: [DM_PLATFORM_E_INTEGER_CONVERSION_ERROR]error:  "The
> server was unable to convert the following string (String Unavailable) to
> an integer or long."
>
>
>
> Caused by: org.apache.manifoldcf.crawler.common.DCTM.DocumentumException:
> Documentum error: [DM_OBJECT_E_LOAD_INVALID_STRING_LEN]error:  "Error
> loading object: invalid string length 0 found in input stream"
>
>
>
> Error: Repeated service interruptions - failure processing document:
> [DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED]error: "The content is
> temporarily parked on a BOCS server host. It will be available when it is
> moved to a permanent storage area."
>
>
>
> Kindly provide your suggestion on this.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
>
>

Mime
View raw message