tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Carter <dcar...@mercycorps.org>
Subject Re: Tika command line performance
Date Fri, 15 Jan 2010 19:27:34 GMT
On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote:
> On Jan 15, 2010, at 11:07am, Doug Carter wrote:
> >
> >Hi all,
> >
> >This may be off-topic for this list, but I need to start somewhere.
> >
> >I need a command line utility to do document format conversion, in a
> >batch mode environment. The batch process is a combination of steps,  
> >one
> >of which is the actual format conversion which is currently being done
> >by a collection of Linux binary converters like wvWare, pdftohtml,  
> >etc.
> >
> >I've put a shell script wrapper around the tika jar:
> >
> > java -jar tika-app.jar [infile] > [outfile]
> >
> >This works OK, but as you would imagine, it is much slower compared to
> >a Linux binary.
> >
> >Does anyone know of a way to improve the performance in a setup like
> >this? I know it goes against the whole philosophy of Java, but is  
> >there
> >a way to compile the Tika jar byte code into a native Linux binary?  
> >I've
> >taken a look at gcj, but it doesn't look like a simple re-compile.
> >
> >Any ideas would be greatly appreciated.
> If you have a set of documents, easiest would be to pass in a  
> directory to tika-app (extend it a bit) so that one invocation of the  
> JVM processes many documents.

Hi Ken,

I've considered something like this (for the exact reason you stated)
but I don't have that flexibility with my current setup. Each document
needs to go through a series of processing steps, one of which is the
format conversion.

Thanks for idea though.


View raw message