manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Wed, 30 Jan 2013 13:33:44 GMT
Hi Andrew,

On Wed, Jan 30, 2013 at 8:21 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
> Hi Karl,
>
> I finally had a chance to go back to this and here's what I found.
>
> Documentum was returning "pdf" and "pdftext" for the content type, not
> a full mime type, so as an experiment I added these to the list of
> allowed mime types in the ElasticSearch configuration for the job.
>
> This time, it got slightly further -- the corresponding documents now
> show as "Success" instead of "REJECTED" in the job history.
>

So you saw events in the history which correspond to these documents
and which are of type "Indexation" that say "success"?  If that is the
case, then the ElasticSearch connector thinks it handed the documents
successfully to the ElasticSearch server.

> However, they don't show up in ElasticSearch, and there's nothing in
> the ES logs or console to indicate that ManifoldCF ever even tried to
> connect. It's like it's just dropping them and declaring the job a
> success.

This is unlikely; it indexes other documents fine, right?  The simple
history entry means that the connector tried and thinks it succeeded
in sending the documents to ES.

Here's the code:

    HttpClient client = getSession();
    ElasticSearchConfig config = getConfigParameters(null);
    InputStream inputStream = document.getBinaryStream();
    long startTime = System.currentTimeMillis();
    ElasticSearchIndex oi = new ElasticSearchIndex(client, documentURI,
        document, inputStream, config);
    activities.recordActivity(startTime, ELASTICSEARCH_INDEXATION_ACTIVITY,
      document.getBinaryLength(), documentURI, oi.getResult().name(),
      oi.getResultDescription());
    if (oi.getResult() != Result.OK)
      return DOCUMENTSTATUS_REJECTED;
    return DOCUMENTSTATUS_ACCEPTED;

If you see events corresponding to this, it means that the indexing
took place as far as the connector knows.  Can you post the exact
simple history row(s) you are seeing?

>
> On the other hand, there are lots of messages like this in the MCF log:
>
> WARN 2013-01-30 13:08:16,431 (Worker thread '12') - Pre-ingest service
> interruption reported for job 1358442009776 connection 'Documentum
> RoW': Job no longer active
>
> Any idea if this could be related?
>

Those messages simply mean that worker threads got interrupted when
you paused or aborted a job, and are harmless.

Karl

>
> On 21 January 2013 12:29, Karl Wright <daddywri@gmail.com> wrote:
>> Logging output is a function of each connector, and unfortunately the
>> documentum connector has pretty limited logging.
>>
>> The extension exclusions are unlikely to be in play because the
>> Documentum connector does not use them.  So it would be only mime type
>> and length.  You should be able to check both of these properties of
>> specific documents you are missing in the Document Webtop UI.
>>
>> Karl
>>
>>   @Override
>>   public boolean checkLengthIndexable(String outputDescription, long length)
>>       throws ManifoldCFException, ServiceInterruption
>>   {
>>     ElasticSearchSpecs specs = getSpecsCache(outputDescription);
>>     long maxFileSize = specs.getMaxFileSize();
>>     if (length > maxFileSize)
>>       return false;
>>     return super.checkLengthIndexable(outputDescription, length);
>>   }
>>
>>   @Override
>>   public boolean checkDocumentIndexable(String outputDescription, File
>> localFile)
>>       throws ManifoldCFException, ServiceInterruption
>>   {
>>     ElasticSearchSpecs specs = getSpecsCache(outputDescription);
>>     return specs
>>         .checkExtension(FilenameUtils.getExtension(localFile.getName()));
>>   }
>>
>>   @Override
>>   public boolean checkMimeTypeIndexable(String outputDescription,
>>       String mimeType) throws ManifoldCFException, ServiceInterruption
>>   {
>>     ElasticSearchSpecs specs = getSpecsCache(outputDescription);
>>     return specs.checkMimeType(mimeType);
>>   }
>>
>>
>> On Mon, Jan 21, 2013 at 6:50 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>> Just to clarify that last post, I haven't disabled any of the allowed
>>> mime types for ES, so as long as they're not something really weird it
>>> should be fine.
>>>
>>> Unless it's a file extension problem (ES also has "allowed file
>>> extensions") but is there a way to get that level of information about
>>> each document out of MCF?
>>>
>>> Can you enable verbose logging somehow to see what type, size and
>>> extension each processed document was?
>>>
>>> On 21 January 2013 11:47, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>> So, the only content types in Documentum are "pdf" and "pdftext".
>>>>
>>>> "application/pdf" is enabled in the ES tab in the job config. (I
>>>> assume they both map to application/pdf -- how would I check for
>>>> sure?)
>>>>
>>>> And my max file size is 16777216000 which is waaaay bigger than any of
>>>> the rejected documents.
>>>>
>>>> Sadly it's still rejecting them all.
>>>>
>>>>
>>>> On 21 January 2013 11:33, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>>>>> Close, it's ElasticSearch. Okay, I'll play around with these, thanks.
>>>>>
>>>>> On 21 January 2013 11:26, Karl Wright <daddywri@gmail.com> wrote:
>>>>>> Hi Andrew,
>>>>>>
>>>>>> The reason for rejection has to do with the criteria you provide
for
>>>>>> the job.  Specifically:
>>>>>>
>>>>>>                   if (activities.checkLengthIndexable(fileLength)
&&
>>>>>> activities.checkMimeTypeIndexable(contentType))
>>>>>>                   {
>>>>>> ...
>>>>>>
>>>>>> These are provided by your output connection; in there you may specify
>>>>>> what mime types and what file length cutoff you want.  From the fact
>>>>>> that you get these, I am guessing it's a Solr connection.  These
>>>>>> criteria typically show up on tabs for the job definition.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Mon, Jan 21, 2013 at 4:52 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm trying to set up a fairly simple crawl where I pull documents
from
>>>>>>> Documentum and push them into ElasticSearch, using the 1.0.1
binary
>>>>>>> release with all appropriate extras for Documentum added.
>>>>>>>
>>>>>>> The repository connection looks fine -- in the job config I can
see
>>>>>>> the paths, document types, content types etc. as expected.
>>>>>>>
>>>>>>> Also the ES output connection looks fine, it reports "connection
working".
>>>>>>>
>>>>>>> However, when I do a crawl, every document it attempts to ingest
shows
>>>>>>> this in the job history:
>>>>>>>
>>>>>>> 01-18-2013 17:36:24.279 fetch 0902620580069898 REJECTED 6264431
>>>>>>>
>>>>>>> (date, time, activity, identifier, result code, bytes, time)
>>>>>>>
>>>>>>> How can I go about diagnosing what's causing this?
>>>>>>>
>>>>>>> I can't see anything suspect in the ManifoldCF stdout or log,
and
>>>>>>> there's nothing in the Documentum server process or registry
process
>>>>>>> output or logs either.
>>>>>>>
>>>>>>> Any ideas how I'd go about diagnosing this?
>>>>>>>
>>>>>>> The Documentum server is on a remote machine administered by
a
>>>>>>> different team, that I don't have direct access to, so any tips
for
>>>>>>> things I could try at my end before escalating it to them would
be
>>>>>>> particularly useful.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Andrew.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>
>>>
>>>
>>> --
>>>
>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message