tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From keithrbennett <keithrbenn...@gmail.com>
Subject Re: Moving Functionality from CLI to ParseUtils
Date Sat, 11 Jul 2009 17:53:51 GMT

Jukka -

Having pluggable parts, as you suggest, is definitely the
way to go for optimum power and flexibility.  However, IMHO,
for the simplest use cases, and for beginning users,
this approach may discourage and complicate Tika's use.
I suggest an alternate simplified interface (see below)
for these uses/users.

Renovating the entrance gate to Tika-land in this way
could result in an increase in the number of
beginning users, who continue on to be advanced users, 
and hopefully developers. A larger installed base could
then result in attracting more resources to the project, human
and otherwise.

* * *

It's been awhile since I worked on Tika, and it's evolved in the
meantime, so I'm not very adept at it these days.

As such, let me use this to the project's advantage, and let you know
what I would value in Tika as a new user.

For the simple cases, I would suggest hiding things like parser
implementations, metadata objects, and content handlers.  The simplest
cases with document type autodetection could be handled by:

parse(InputStream inputStream, OutputStream outputStream)

Then, to specify the document type, we could add a MimeType string

parse(InputStream inputStream, OutputStream outputStream, 
	String mimeType)

I realize that this approach is not very efficient with multiple
documents, since there is setup work that needs to be done for each
document, but it is probably not an issue for most casual users.

Another question...I used Tika to parse an Excel spreadsheet. and it
created an XML file.  How could I insert a handler for parsing
documents with multiple records (such as an Excel spreadsheets, so
that I could, for example, insert the record into a data base instead
of writing XML to a file?  Rather than writing a full blown XML
content handler, I wonder if we could simplify it to something like

public interface RecordProcessor {   
    void processRecord(Object [] fields); // or List

... and then have a method like:

parseSpreadsheet(InputStream inputStream, 
	RecordProcessor recordProcessor)

For the above methods, we might also provide convenience methods for
Files, URLs, Strings, etc.

IMHO, having extremely simple methods like these would make it more
likely for new users to attempt to use Tika, and to succeed in using

I realize everyone's busy, and my time is limited too; this is just a
wish list.  Also, to the extent that these suggestions are based on a lack
of understanding of how Tika works, I apologize for that and welcome
any clarification.


Jukka Zitting wrote:
> Instead of a fixed facade like ParseUtils I personally prefer a set of
> components that I can combine in different ways to solve all kinds of
> use cases. For example your case would be easy to solve like this:
>     InputStream input = ...; // Where your input is coming from
>     OutputStream output = ...; // Where your output is going to
>     new AutoDetectParser().parse(
>         input, new BodyContentHandler(output), new Metadata());
> Of course a static facade method like ParseUtils.parse(File input,
> File output) might be easier for occasional users.
> Did you have some specific method signatures in mind?
> BR,
> Jukka Zitting

View this message in context: http://www.nabble.com/Moving-Functionality-from-CLI-to-ParseUtils-tp24337541p24442544.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

View raw message