tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Tika command line performance
Date Fri, 15 Jan 2010 19:37:30 GMT

On Jan 15, 2010, at 11:27am, Doug Carter wrote:

> On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote:
>> On Jan 15, 2010, at 11:07am, Doug Carter wrote:
>>> Hi all,
>>> This may be off-topic for this list, but I need to start somewhere.
>>> I need a command line utility to do document format conversion, in a
>>> batch mode environment. The batch process is a combination of steps,
>>> one
>>> of which is the actual format conversion which is currently being  
>>> done
>>> by a collection of Linux binary converters like wvWare, pdftohtml,
>>> etc.
>>> I've put a shell script wrapper around the tika jar:
>>> java -jar tika-app.jar [infile] > [outfile]
>>> This works OK, but as you would imagine, it is much slower  
>>> compared to
>>> a Linux binary.
>>> Does anyone know of a way to improve the performance in a setup like
>>> this? I know it goes against the whole philosophy of Java, but is
>>> there
>>> a way to compile the Tika jar byte code into a native Linux binary?
>>> I've
>>> taken a look at gcj, but it doesn't look like a simple re-compile.
>>> Any ideas would be greatly appreciated.
>> If you have a set of documents, easiest would be to pass in a
>> directory to tika-app (extend it a bit) so that one invocation of the
>> JVM processes many documents.
> Hi Ken,
> I've considered something like this (for the exact reason you stated)
> but I don't have that flexibility with my current setup. Each document
> needs to go through a series of processing steps, one of which is the
> format conversion.

In that case, another cheesy solution is to have the Java process  
watch a specific directory. Whenever a new file (with the appropriate  
name format) appears, it gets processed. This Java process then  
continues to run indefinitely as a kind of processing daemon.

You can avoid hand-off problems by using a name pattern, and renaming  
the file when it's really ready for processing.

There are lots of cleaner, more sophisticated systems involving  
notification systems, queues, RESTful services, etc. which might be  
more appropriate, depending on your needs.

-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message