manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Clegg <andrew.cl...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Mon, 21 Jan 2013 11:50:04 GMT
Just to clarify that last post, I haven't disabled any of the allowed
mime types for ES, so as long as they're not something really weird it
should be fine.

Unless it's a file extension problem (ES also has "allowed file
extensions") but is there a way to get that level of information about
each document out of MCF?

Can you enable verbose logging somehow to see what type, size and
extension each processed document was?

On 21 January 2013 11:47, Andrew Clegg <andrew.clegg@gmail.com> wrote:
> So, the only content types in Documentum are "pdf" and "pdftext".
>
> "application/pdf" is enabled in the ES tab in the job config. (I
> assume they both map to application/pdf -- how would I check for
> sure?)
>
> And my max file size is 16777216000 which is waaaay bigger than any of
> the rejected documents.
>
> Sadly it's still rejecting them all.
>
>
> On 21 January 2013 11:33, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>> Close, it's ElasticSearch. Okay, I'll play around with these, thanks.
>>
>> On 21 January 2013 11:26, Karl Wright <daddywri@gmail.com> wrote:
>>> Hi Andrew,
>>>
>>> The reason for rejection has to do with the criteria you provide for
>>> the job.  Specifically:
>>>
>>>                   if (activities.checkLengthIndexable(fileLength) &&
>>> activities.checkMimeTypeIndexable(contentType))
>>>                   {
>>> ...
>>>
>>> These are provided by your output connection; in there you may specify
>>> what mime types and what file length cutoff you want.  From the fact
>>> that you get these, I am guessing it's a Solr connection.  These
>>> criteria typically show up on tabs for the job definition.
>>>
>>> Karl
>>>
>>> On Mon, Jan 21, 2013 at 4:52 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>> Hi,
>>>>
>>>> I'm trying to set up a fairly simple crawl where I pull documents from
>>>> Documentum and push them into ElasticSearch, using the 1.0.1 binary
>>>> release with all appropriate extras for Documentum added.
>>>>
>>>> The repository connection looks fine -- in the job config I can see
>>>> the paths, document types, content types etc. as expected.
>>>>
>>>> Also the ES output connection looks fine, it reports "connection working".
>>>>
>>>> However, when I do a crawl, every document it attempts to ingest shows
>>>> this in the job history:
>>>>
>>>> 01-18-2013 17:36:24.279 fetch 0902620580069898 REJECTED 6264431
>>>>
>>>> (date, time, activity, identifier, result code, bytes, time)
>>>>
>>>> How can I go about diagnosing what's causing this?
>>>>
>>>> I can't see anything suspect in the ManifoldCF stdout or log, and
>>>> there's nothing in the Documentum server process or registry process
>>>> output or logs either.
>>>>
>>>> Any ideas how I'd go about diagnosing this?
>>>>
>>>> The Documentum server is on a remote machine administered by a
>>>> different team, that I don't have direct access to, so any tips for
>>>> things I could try at my end before escalating it to them would be
>>>> particularly useful.
>>>>
>>>> Thanks,
>>>>
>>>> Andrew.
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message