tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith R. Bennett" <kbenn...@bbsinc.biz>
Subject Moving Functionality from CLI to ParseUtils
Date Sat, 04 Jul 2009 19:56:46 GMT

Hi, all.  Long time no talk.  I had been working part time and on a kind of
sabbatical during which I abandoned Java in favor of studying Ruby and
and attending and organizing BarCamp's.

About three months ago, I started a new job, working with Java again.  The
to extract structured data from Excel spreadsheets arose, and I wrote a
script that called Tika to manage the parsing.

In the process, I think I identified some possible improvements to Tika. It
would be nice to simplify one of the simplest use cases, where you want Tika
parse a document using default configurations, and specify its output

There is a very general mechanism for parsing in CLI, but it is not possible
override the output stream default (stdout), and awkward to call it from a
program rather than on the command line.  I have two suggestions: 

1) Make the output destination a configuration option (a command line
that defaults to stdout (perhaps "-o").  Although it's easy to redirect
on the command line, it's not quite so simple when that command is called
a script that itself may be redirected.  Also, when the command is executed
within another program, there may be issues as well.

2) Move the methods that do the work to ParseUtils, and leave only a thin
command line wrapper around them in CLI.  It would be helpful for scripts
Java programs to have these easy to use methods available too.   It seems
wasteful to force the caller to construct a command line to do this.

What do you think?


View this message in context: http://www.nabble.com/Moving-Functionality-from-CLI-to-ParseUtils-tp24337541p24337541.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

View raw message