tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1776) tika stop converting at this pdf document
Date Tue, 20 Oct 2015 12:47:27 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965053#comment-14965053
] 

Tim Allison commented on TIKA-1776:
-----------------------------------

I'm not able to reproduce this on Windows or RHEL with PDFBox's app or with tika-app 1.10
using the same call that you are.

Are you able to reproduce this outside of ruby?

Tika does hang forever sometimes...very rarely, and we need to fix it when it does, but anyone
calling Tika needs to be aware of this and protect against it.

If you try calling the actual tika-batch code via the app: {{java -jar tika-app.jar -i <input_dir>
-o <output_dir>}}

That should automatically restart the process if it runs into a hang.

> tika stop converting at this pdf document
> -----------------------------------------
>
>                 Key: TIKA-1776
>                 URL: https://issues.apache.org/jira/browse/TIKA-1776
>             Project: Tika
>          Issue Type: Bug
>          Components: batch
>    Affects Versions: 1.10
>         Environment: Intel Core I5 4GB Ram, Notebook
> OS: debian8, x64, Gnome
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]
>            Reporter: tranquillo
>
> Hi and thank you all for this great project,
> I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs
and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big
files. But since over 15hours the process hangs with CPU load = 0% at one file: 
> http://ratsinfo.dresden.de/getfile.php?id=149624&type=do 
> wich is just 5mb large, but contains text, scans and CAD plans.
> I run "get_xml()" from follwing class (located in tika_app.rb):
> -----------------------------
> require 'rubygems'
> require 'stringio'
> require 'open4'
> class TikaApp
>     def initialize(document)
>         filename = File.basename(document)
>         t = Time.now
>         puts t.strftime("%H:%M:%S") + ": analyze #{filename}"
>         @document = document
>         java_cmd = 'java'
>         java_args = '-server -Djava.awt.headless=true'
>         tika_path = "tika-app.jar"
>         @tika_cmd = "#{java_cmd} #{java_args} -jar '#{tika_path}'"
>     end
>     def get_xml
>         run_tika('--xml')
>     end
>     def get_metadata
>         run_tika('--metadata --json')
>     end
>     private
>     def run_tika(option)
>         final_cmd = "#{@tika_cmd} #{option} '#{@document}'"
>         pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
>         stdout_result = stdout.read.strip
>         stderr_result = stderr.read.strip
>         unless strip_stderr(stderr_result).empty?
>         end
>         stdout_result
>     ensure
>         stdin.close
>         stdout.close
>         stderr.close
>     end
>     def strip_stderr(s)
>         s.gsub(/^(info|warn) - .*$/i, '').strip
>     end
> end
> ----------
> The tika command with this function looks like this: 
> java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message