tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Tika pipelines (was: Tika discussions in Amsterdam)
Date Tue, 08 May 2007 12:47:57 GMT

On 5/4/07, Bertrand Delacretaz <bdelacretaz@apache.org> wrote:
> On 5/3/07, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> >... * processing pipeline: There was a quick idea on possibly organizing
> > the Tika framework as a pipeline of content detection and extraction
> > components....
> I thought a bit more about that, and maybe a dual-channel pipeline
> structure, with generalized filters, might be interesting [...]

I like the idea, though explicitly passing bytes in a separate channel
might be a bit troublesome and I'm not sure if it's even needed, i.e.
I'm not sure if we will have filtering components that need both the
the structured content and the binary stream as a synchronized input.

> By "generalized filters" I mean that the interface to all filters is
> the same, the Tika pipeline doesn't necessarily impose a two-phase
> process, it just chains a series of filters which collaborate to
> analyze the input stream.

Good point. How about a generic interface like this:

    interface Parser {

        void parse(
            InputStream input, ContentHandler handler, Metadata metadata)
            throws ExtractException, IOException, SAXException;


A parser component would always be given the full input stream as
input. The metadata object would be used both for input and output,
i.e. it could carry the filename and content type information as input
and end up with all sorts of internal metadata as output.

A lightweight component like a content type detector could just fill
in the appropriate metadata fields and produce dummy SAX events (we
could have a baseline class that just produces XHTML meta tags based
on the extracted metadata).

Pipeline processing could be implemented by recursively calling other
parsing components. A generic filter component could be used to add
filtering for the binary input stream the SAX output stream.
Processing could also be branced to alternative parsers until one

We could support multiple passes over a single input stream with the
mark/reset feature.


Jukka Zitting

View raw message