tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bertrand Delacretaz" <bdelacre...@apache.org>
Subject [RT] Tika framework usage scenario
Date Wed, 13 Jun 2007 08:40:08 GMT
Hi,

Here are some Random Thoughs about how Tika could be used, mostly
based on (my recollection of) our discussion at ApacheCon.

See also:
http://code.google.com/p/tika/wiki/DesignDiscussion
http://code.google.com/p/tika/wiki/ArchitectureSketch

Comments/flames/etc. are welcome ;-)

Here's my proposed Tika Framework Usage Scenario:

A Pipeline takes an InputStream as input.
(not a Reader, as we might need to try different encodings).

Internally, a Pipeline consists of a series of ContentFilters
connected in a chain.
(details to be defined: encoding and content-type detectors, file
format parsers, etc.).

A Pipeline is created by the PipelineFactory, based on a StreamInfo.

A StreamInfo contains all the relevant info that we have about the
input stream: filename, HTTP headers, encoding, expected language,
configured hints and preferences, etc...everything that can help the
PipelineFactory in deciding how to setup the Pipeline.

Once its start() method is called, a Pipeline reads the InputStream
and produces ContentEvents.

A ContentEvent can be a MetadataEvent, a StreamEvent, a TextEvent or a
TikaInfoEvent.

A MetadataEvent contains extracted metadata (obviously ;-)

The names of metadata properties are standardized, as far as possible
(dublin core, etc.)

A StreamEvent encapsulates an InputStream and a StreamInfo, for
example when the original input was a ZIP archive that contains
several binary components. If the client is interested in this event,
it will have to create another Pipeline to process its contents.

A TextEvent contains extracted text, location information, etc.

A TikaInfoEvent provides information about the Pipeline execution:
progress, debugging messages, warnings, etc.

The order in which ContentEvents are produced by the Pipeline is not specified.

WDYT?

-Bertrand

Mime
View raw message