tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: Container Extractor?
Date Wed, 01 Sep 2010 12:16:01 GMT

On Wed, Sep 1, 2010 at 11:54 AM, Nick Burch <nick.burch@alfresco.com> wrote:
> My idea is that you'd pass to this "service" a container file. You'd also
> say if you wanted recursion, and which mime types interest you. The result
> would be say an iterator of input stream, which would probably also let you
> get the filenames and mime types where supported by the container.

The main complexity I see here is what the return values of such a
service would look like, especially if you need to support cases where
the container document is only available as an InputStream (i.e. no
random access). Then you'd either need to use temporary files (or
in-memory buffers) or a callback interface like this one:

    public interface ComponentDocumentHandler {
        void handleComponentDocument(
            InputStream stream, Metadata metadata)
            throws IOException, TikaException;

Such callbacks could be trivially produced by passing a custom Parser
instance through the ParseContext to the package parser. The custom
Parser class should have a parse() method like this:

    public void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context)
            throws IOException, SAXException, ParseException {
        componentDocumentHandler.handleComponentDocument(stream, metadata);

> What do people think? Is this useful? Is this appropriate for Tika? If yes
> to these two, does the rough method signature sound sane?

+1 to having something like this in Tika, as long as we can come up
with a clean API.


Jukka Zitting

View raw message