manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msaunier <msaun...@citya.com>
Subject RE: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour
Date Tue, 29 May 2018 16:16:19 GMT
I'm looking if I can find a non-private file.

 

Thanks,

Maxence,

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mardi 29 mai 2018 18:14
À : user@manifoldcf.apache.org
Objet : Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
error SPAM 10Go/hour

 

This is indeed a Tika bug, or a bug in the underlying PDFBox code it uses.

 

In order to make progress, we need a sample document that demonstrates the problem.  Once
we have that, I can open a Tika ticket.

 

Thanks,

Karl

 

 

On Tue, May 29, 2018 at 12:06 PM msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

Hello Karl,

 

PS: at this moment, I have 24 document bloqued. 20 status «Processing » and 4 status «
About to Process ».

 

So, I have test and they are they sames. So, I have import the file and used tika-app.jar
to test in local and I have this error for they files:

 

WARN  Invalid XObject Subtype: null

WARN  Invalid XObject Subtype: null

WARN  Invalid XObject Subtype: null

…

WARN  Invalid XObject Subtype: null

WARN  Invalid XObject Subtype: null

WARN  Invalid XObject Subtype: null

WARN  Invalid XObject Subtype: null

Exception in thread "main" java.lang.StackOverflowError

        at java.util.zip.Inflater.<init>(Inflater.java:102)

        at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:99)

        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)

        at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)

        at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77)

        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)

        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)

        at org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject.getContents(PDFormXObject.java:144)

        at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:493)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)

        at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)

        at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)

…

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)

        at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)

        at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)

        at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)

        at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)

        at org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)

        at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)

 

If I open the file with « Edge », it’s good.

 

Any idea?

 

Thanks,

Maxence,

 

 

De : Karl Wright [mailto:daddywri@gmail.com <mailto:daddywri@gmail.com> ] 
Envoyé : lundi 28 mai 2018 18:47
À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
error SPAM 10Go/hour

 

This sounds potentially like a problem in Tika, but in order to be sure I would need a complete
stack trace, not just a piece of one.

If it is a Tika issue, it should appear reliably on the same document, again and again.

 

Is there any way you can crawl ONLY one of the documents that got blocked?  I suspect that
when you paused and restarted, you just postponed the problem and it will happen again.

 

Karl

 

 

On Mon, May 28, 2018 at 9:50 AM msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

Hello Karl,

 

In Manifoldcf 2.9 for all jobs at the end of the job, several documents, around twenty, remain
blocked.

A single error appears and it spam the logs of several gigabytes in a short time which filled
the servers :

 

[?:?]

               at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
~[?:?]

               at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
~[?:?]

               at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:231)
~[?:?]

 

If I paused the job and start, documents are send and it working. But, if I’m not there,
we have problems.

 

Do you now this problem and do you have a solution ? It’s a bad configuration ?

 

Thanks you.


Mime
View raw message