tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop
Date Wed, 16 May 2018 14:14:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477486#comment-16477486

Tim Allison commented on TIKA-2643:

bq. The tricky part is I cannot attach a debugger against this call within MapReduce job over
the cluster. 

Ugh.  Right. Of course.  Anything more you can do with logging?  I didn't read through your
logs well enough, but can you confirm that the hang is happening during parseToString() and
not immediately after it?

Without understanding your full framework, I can't think of what might be causing this with
any accuracy. :)

Some things that have caused permanent hangs for me in the past:
1) not clearing stderr/stdout from a child process
2) infinite loops in parsers 
3) blocking IO that, well, blocks
4) calling take() instead of poll() on an ExecutorCompletionService that is blocking
5) well, more generally, calling any of the blocking methods on theoretically concurrent/non-blocking
objects, ArrayBlockingQueue, etc. instead of calling the non-blocking alternatives
6) Not-quite a permanent hang, but crazy churn caused by multithreaded garbage collection

I don't think this is the fault of the parser (2 above).  We can see from the logs, that the
parser is making at least some progress into the file.

Do any of the above look like candidates for you?

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, testJournalParser.pdf
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to process a
pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can process some other pdf files
on the same cluster. I am attaching the file and the syslog as well as stdout logs. Interesting
that the same file can be processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to Cloudera
cluster, a major flavor of Hadoop, so your timely attention would be very much appreciated.

This message was sent by Atlassian JIRA

View raw message