tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith R. Bennett" <kbenn...@bbsinc.biz>
Subject Moving Functionality from CLI to ParseUtils
Date Sat, 04 Jul 2009 19:56:46 GMT

Hi, all.  Long time no talk.  I had been working part time and on a kind of
sabbatical during which I abandoned Java in favor of studying Ruby and
Clojure,
and attending and organizing BarCamp's.

About three months ago, I started a new job, working with Java again.  The
need
to extract structured data from Excel spreadsheets arose, and I wrote a
JRuby
script that called Tika to manage the parsing.

In the process, I think I identified some possible improvements to Tika. It
would be nice to simplify one of the simplest use cases, where you want Tika
to
parse a document using default configurations, and specify its output
stream.

There is a very general mechanism for parsing in CLI, but it is not possible
to
override the output stream default (stdout), and awkward to call it from a
program rather than on the command line.  I have two suggestions: 

1) Make the output destination a configuration option (a command line
parameter)
that defaults to stdout (perhaps "-o").  Although it's easy to redirect
output
on the command line, it's not quite so simple when that command is called
within
a script that itself may be redirected.  Also, when the command is executed
from
within another program, there may be issues as well.

2) Move the methods that do the work to ParseUtils, and leave only a thin
command line wrapper around them in CLI.  It would be helpful for scripts
and
Java programs to have these easy to use methods available too.   It seems
wasteful to force the caller to construct a command line to do this.

What do you think?

Cheers,
Keith

-- 
View this message in context: http://www.nabble.com/Moving-Functionality-from-CLI-to-ParseUtils-tp24337541p24337541.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Mime
View raw message