tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Tika command line performance
Date Fri, 15 Jan 2010 19:19:31 GMT

On Jan 15, 2010, at 11:07am, Doug Carter wrote:

> Hi all,
> This may be off-topic for this list, but I need to start somewhere.
> I need a command line utility to do document format conversion, in a
> batch mode environment. The batch process is a combination of steps,  
> one
> of which is the actual format conversion which is currently being done
> by a collection of Linux binary converters like wvWare, pdftohtml,  
> etc.
> I've put a shell script wrapper around the tika jar:
>  java -jar tika-app.jar [infile] > [outfile]
> This works OK, but as you would imagine, it is much slower compared to
> a Linux binary.
> Does anyone know of a way to improve the performance in a setup like
> this? I know it goes against the whole philosophy of Java, but is  
> there
> a way to compile the Tika jar byte code into a native Linux binary?  
> I've
> taken a look at gcj, but it doesn't look like a simple re-compile.
> Any ideas would be greatly appreciated.

If you have a set of documents, easiest would be to pass in a  
directory to tika-app (extend it a bit) so that one invocation of the  
JVM processes many documents.

-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message