tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Tika use cases
Date Mon, 10 Sep 2007 17:22:22 GMT

On 9/10/07, kbennett <kbennett@bbsinc.biz> wrote:
> It seems to me that options going into the parser are logically different
> from metadata coming out of the parser, and that to maximize the code's
> cohesion (see http://en.wikipedia.org/wiki/Cohesion_%28computer_science%29),
> it would be preferable to express them as two different objects.

There are really two kinds of options that could affect the way a
parser would work. The first kind are generic options like the maximum
amount of memory or time to use, the location of any temporary files
to be used, etc. that don't have any direct relation to the specific
document being parsed. The other kind are parsing hints related to the
parsed document, like the name (and extension) of the file that
contains the document, any MIME headers associated with the document
(for example from a HTTP request or an email body part), etc.

The first kind of options I'd really handle separately as JavaBean
properties or some such of the parser instances, but the second kind
is actually more or less accurate metadata about the document in
question, so IMHO it would make perfect sense to pass that information
as a part of the metadata argument.

> Also, if the metadata is the only output of the parser (as it appears to be
> in the use case), why not have the parser create the metadata object itself,
> and return it as the return value?  This would seem like a more natural
> interface.

As mentioned above, I think the metadata object could (and should) be
used to pass various parsing "hints" to the parser, and that the
parser can then extend, verify, or correct the given metadata. This
approach also allows one to have a sequence of parsers that
incrementally extract more and more information from the input


Jukka Zitting

View raw message