tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Tika discussions in Amsterdam
Date Thu, 03 May 2007 07:06:02 GMT

Quick summary of the Tika discussions from yesterday's text analysis
BOF at the ApacheCon EU. It's the next morning now, so I'm probably
missing a lot of stuff...

* release early: The consensus seemed to be that we should do an early
release using one of the existing codebases (most likely the Nutch
parser framework or the Lius codebase) instead of trying to come up
with the "perfect design" up front before doing the first release. We
perhaps should do some initial work to avoid excessive API changes
later on, but that's a secondary consideration to releasing working
code early.

* processing phases: Tika has basically two main phases of operation:
content detection and content extraction. The content detection phase
tries to detect the content type of a document given a binary stream
and optional typing metadata. The main output from this phase is the
content type of the given document, but it is also possible to output
some easily accessible metadata (image size, etc.) already during this
phase. The content extraction phase is given the binary stream and the
detected typing information, and the expected output is the structured
text content and any available metadata from the document.

* processing pipeline: There was a quick idea on possibly organizing
the Tika framework as a pipeline of content detection and extraction

* plugin framework: We should design Tika so that existing plugin
frameworks can easily be used to assemble and configure Tika

* structured text: We should optimally use XHTML Basic as a SAX event
stream as the "structured text content" to be extracted from
documents. Namespaces can be used to extend the format to contain
extra metadata like PDF offsets for highlighting. Other options
expressed were using plain text, custom annotations or markup
mechanisms, full document formats like odf, or perhaps a DOM tree.

* additional processing components: It should be possible to embed
additional content extraction tools like thumbnail generators, image
index generators, etc. as plugins in the Tika framework. We might even
want to add support for tools like virus and spam detectors.

* container formats: It should be possible to use Tika recursively to
process container formats like zip files or many of the video formats.
The same mechanism could also be used to handle compressed or
encrypted files.

* integration: We should design the output format of Tika so that it
is easy to map to whatever Lucene, Solr, UIMA, and other similar
projects expect as input.

* security: Tika should contain some safeguards against
denial-of-service attacks that trick a parser library to spend excess
memory or processing power on a single document.

* parser libraries: We might want to encourage external parser
libraries to join us at the ASF, but Tika by itself should not try to
reimplement or compete with existing parsers.


Jukka Zitting

View raw message