tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop
Date Mon, 21 May 2018 15:03:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482586#comment-16482586
] 

Ken Krugler commented on TIKA-2643:
-----------------------------------

When you've got conflicting jars on the classpath, you often run into this, as the ordering
of jars isn't guaranteed. So it can actually run fine one time, and fail another time, due
to some subtle perturbation of the environment.

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, hs_err_pid32104.log, testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to process a
pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can process some other pdf files
on the same cluster. I am attaching the file and the syslog as well as stdout logs. Interesting
that the same file can be processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to Cloudera
cluster, a major flavor of Hadoop, so your timely attention would be very much appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message