manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour
Date Mon, 28 May 2018 16:47:04 GMT
This sounds potentially like a problem in Tika, but in order to be sure I
would need a complete stack trace, not just a piece of one.

If it is a Tika issue, it should appear reliably on the same document,
again and again.

Is there any way you can crawl ONLY one of the documents that got blocked?
I suspect that when you paused and restarted, you just postponed the
problem and it will happen again.

Karl


On Mon, May 28, 2018 at 9:50 AM msaunier <msaunier@citya.com> wrote:

> Hello Karl,
>
>
>
> In Manifoldcf 2.9 for all jobs at the end of the job, several documents,
> around twenty, remain blocked.
>
> A single error appears and it spam the logs of several gigabytes in a
> short time which filled the servers :
>
>
>
> [?:?]
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
> ~[?:?]
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
> ~[?:?]
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:231)
> ~[?:?]
>
>
>
> If I paused the job and start, documents are send and it working. But, if
> I’m not there, we have problems.
>
>
>
> Do you now this problem and do you have a solution ? It’s a bad
> configuration ?
>
>
>
> Thanks you.
>

Mime
View raw message