tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jakubik <p...@purediscovery.com>
Subject Metadata Discussion Status
Date Mon, 02 Aug 2010 20:36:05 GMT
Hi,

A while ago I added the http://wiki.apache.org/tika/MetadataDiscussion page
to the Tika wiki.

Since then, with the help of Jukka Zitting, a solution has been described
for using the current Tika library to capture nested document metadata and
associate that with the text extracted for each nested document.

What hasn't been accomplished is identifying a way to get to both the
metadata and text for nested documents without the user writing a
ContentHandler.

Here are some possibilities for moving forward:

   - Decide that anyone who wants to identify the text and metadata
   associated with each nested document must write their own ContentHandler and
   ParserDecorator that gathers and associates text with the corresponding
   metadata.
   - Point out easier ways to accomplish the same thing with the existing
   Tika libraries.
   - Provide a new Parser and ContentHandler combination that gathers
   subdocument text and metadata together and provides a stream of events
   (maybe something other than XHTML) with easier recursive document and
   metadata handling.
   - Come up with a way to add nested metadata to the XHTML stream without
   violating XHTML

Are there any thoughts on how to move forward? Is it okay if users who want
to extract nested documents with metadata resort to writing their own
content handlers and parser decorators? Or would the Tika team prefer to
offer an easier way for users to extract nested documents with metadata?

Paul

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message