tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Tika use cases
Date Fri, 24 Aug 2007 20:56:45 GMT

On 8/24/07, Rida Benjelloun <rida.benjelloun@doculibre.com> wrote:
> I agree with your use case.


I've been thinking about this a bit more. The main thing I'm concerned
about the current Parser classes from Lius Lite is that they always
parse the entire document into an in-memory data structure. This can
easily become a scalability issue and I'd like to avoid that already
on the design level.

Also, it seems to me that the current regexp and xpath features from
Lius would work better as a layer on top of the parser code instead of
as an integral part of it.

As for my design proposal itself, I think I have a more workable
approach to use cases 1 (extract structured content) and 3 (extract
metadata). It looks like this:

Extract metadata:

    InputStream stream = ...;
    Metadata metadata = new Metadata();
    SomeTikaInterface parser = new SomeTikaClass();
    parser.extractMetadata(stream, metadata);

Extract structured content (and metadata as a side-effect):

    InputStream stream = ...;
    ContentHandler handler = ...; // SAX event handler
    Metadata metadata = new Metadata();
    SomeTikaInterface parser = new SomeTikaClass();
    parser.extractContent(stream, handler, metadata);

In both cases it would be possible to feed existing metadata hints
(like the file name, Content-Type header, or some other similar
information) to the parser through the metadata argument.

WDYT? I'd like to start going forward with some code along these
lines, most likely by adapting/refactoring the Lius classes we already


Jukka Zitting

View raw message