tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bertrand Delacretaz" <bdelacre...@apache.org>
Subject Tika pipelines (was: Tika discussions in Amsterdam)
Date Fri, 04 May 2007 09:18:07 GMT
On 5/3/07, Jukka Zitting <jukka.zitting@gmail.com> wrote:

>... * processing pipeline: There was a quick idea on possibly organizing
> the Tika framework as a pipeline of content detection and extraction
> components....

I thought a bit more about that, and maybe a dual-channel pipeline
structure, with generalized filters, might be interesting (ASCII art

         +------------+        +------------+        +------------+
  -------+            +--------+            +--------+            +-------
  -------+    F1      +--------+     F2     +--------+     F3     +-------
         |            |        |            |        |            |
  -------+            +--------+            +--------+            |
         +------------+        +------------+        +------------+

      ----------   Extracted content, events, metadata, filter options

      ----------   Binary data

By dual-channel I mean that each filter outputs extracted content,
metadata and events (language change etc) on one "channel", and *can*
output the binary stream as well, on a (conceptually) separate
channel. The last filter in the chain usually only outputs the first

I think this might be very useful, for example to chain filters which
have different ways of deciding how to process the input stream, and
getting the aggregated metadata which describes their "decisions"
after they have all examined the input.

Or to insert a filter which only cares about detecting the input
encoding, but doesn't know much about content extraction.

In practice, the dual-channel could be implemented by simply adding a
bytes(...) method to the standard ContentHandler interface - but how
we do it is not too important at this design stage.

By "generalized filters" I mean that the interface to all filters is
the same, the Tika pipeline doesn't necessarily impose a two-phase
process, it just chains a series of filters which collaborate to
analyze the input stream.

I haven't done much reality checks on this yet, but I think allowing
the binary stream to be relayed to multiple filters in the chain could
help make things more modular, while adding little complexity.

The main idea is to keep as much information as possible far in the
pipeline to make filters more independent of each others.


View raw message