tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tranquillo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1776) tika stop converting at this pdf document
Date Tue, 20 Oct 2015 16:47:27 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965374#comment-14965374
] 

tranquillo commented on TIKA-1776:
----------------------------------

Thank you Tim for your quick response,

you are right, tika seemed not be the problem, because a try out on the console converts the
file. It also prints a lot of errors, if  everything outputs on console, but i think thats
ok.

This thing can be closed, i will try to call tika another way.

Sorry for the circumstances.

> tika stop converting at this pdf document
> -----------------------------------------
>
>                 Key: TIKA-1776
>                 URL: https://issues.apache.org/jira/browse/TIKA-1776
>             Project: Tika
>          Issue Type: Bug
>          Components: batch
>    Affects Versions: 1.10
>         Environment: Intel Core I5 4GB Ram, Notebook
> OS: debian8, x64, Gnome
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]
>            Reporter: tranquillo
>
> Hi and thank you all for this great project,
> I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs
and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big
files. But since over 15hours the process hangs with CPU load = 0% at one file: 
> http://ratsinfo.dresden.de/getfile.php?id=149624&type=do 
> wich is just 5mb large, but contains text, scans and CAD plans.
> I run "get_xml()" from follwing class (located in tika_app.rb):
> -----------------------------
> require 'rubygems'
> require 'stringio'
> require 'open4'
> class TikaApp
>     def initialize(document)
>         filename = File.basename(document)
>         t = Time.now
>         puts t.strftime("%H:%M:%S") + ": analyze #{filename}"
>         @document = document
>         java_cmd = 'java'
>         java_args = '-server -Djava.awt.headless=true'
>         tika_path = "tika-app.jar"
>         @tika_cmd = "#{java_cmd} #{java_args} -jar '#{tika_path}'"
>     end
>     def get_xml
>         run_tika('--xml')
>     end
>     def get_metadata
>         run_tika('--metadata --json')
>     end
>     private
>     def run_tika(option)
>         final_cmd = "#{@tika_cmd} #{option} '#{@document}'"
>         pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
>         stdout_result = stdout.read.strip
>         stderr_result = stderr.read.strip
>         unless strip_stderr(stderr_result).empty?
>         end
>         stdout_result
>     ensure
>         stdin.close
>         stdout.close
>         stderr.close
>     end
>     def strip_stderr(s)
>         s.gsub(/^(info|warn) - .*$/i, '').strip
>     end
> end
> ----------
> The tika command with this function looks like this: 
> java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message