tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Carter <dcar...@mercycorps.org>
Subject Re: Tika command line performance
Date Fri, 15 Jan 2010 20:21:16 GMT
On Fri, Jan 15, 2010 at 11:37:30AM -0800, Ken Krugler wrote:
> 
> On Jan 15, 2010, at 11:27am, Doug Carter wrote:
> 
> >On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote:
> >>
> >>On Jan 15, 2010, at 11:07am, Doug Carter wrote:
> >>
> >>>
> >>>Hi all,
> >>>
> >>>This may be off-topic for this list, but I need to start somewhere.
> >>>
> >>>I need a command line utility to do document format conversion, in a
> >>>batch mode environment. The batch process is a combination of steps,
> >>>one
> >>>of which is the actual format conversion which is currently being  
> >>>done
> >>>by a collection of Linux binary converters like wvWare, pdftohtml,
> >>>etc.
> >>>
> >>>I've put a shell script wrapper around the tika jar:
> >>>
> >>>java -jar tika-app.jar [infile] > [outfile]
> >>>
> >>>This works OK, but as you would imagine, it is much slower  
> >>>compared to
> >>>a Linux binary.
> >>>
> >>>Does anyone know of a way to improve the performance in a setup like
> >>>this? I know it goes against the whole philosophy of Java, but is
> >>>there
> >>>a way to compile the Tika jar byte code into a native Linux binary?
> >>>I've
> >>>taken a look at gcj, but it doesn't look like a simple re-compile.
> >>>
> >>>Any ideas would be greatly appreciated.
> >>
> >>If you have a set of documents, easiest would be to pass in a
> >>directory to tika-app (extend it a bit) so that one invocation of the
> >>JVM processes many documents.
> >
> >Hi Ken,
> >
> >I've considered something like this (for the exact reason you stated)
> >but I don't have that flexibility with my current setup. Each document
> >needs to go through a series of processing steps, one of which is the
> >format conversion.
> 
> In that case, another cheesy solution is to have the Java process  
> watch a specific directory. Whenever a new file (with the appropriate  
> name format) appears, it gets processed. This Java process then  
> continues to run indefinitely as a kind of processing daemon.
> 
> You can avoid hand-off problems by using a name pattern, and renaming  
> the file when it's really ready for processing.
> 
> There are lots of cleaner, more sophisticated systems involving  
> notification systems, queues, RESTful services, etc. which might be  
> more appropriate, depending on your needs.

Interesting approach. Thanks for the idea.

Doug


Mime
View raw message