tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Suggestion to return XML sax events instead of XHTML sax events
Date Thu, 23 Oct 2008 07:49:10 GMT

On Wed, Oct 22, 2008 at 6:05 PM, Stephane Bastian
<stephane_bastian@hotmail.com> wrote:
> 1) Is there a reason that would prevent Tika from returning Xml type events
> as opposed to Xml events?

The idea behind Tika is to allow client applications to access the
content of a document with little or no knowledge of the internal
structure of that document. Inventing new XML vocabularies in Tika to
better match the underlying document structures would violate this
design goal.

We chose XHTML as the output format instead of a plain character
stream to convey structural information like headings and links to
client applications that need or benefit from such information. For
example in the MP3 case you mentioned a full text indexer could use
the <h1/> tag to boost the importance of the title over the other
extracted text. This would be impossible or at least noticeably harder
if we used custom XML vocabularies.

> 2) Do you feel XML events would provide substantial improvements over the
> current solution?

No. You should be looking at the extracted metadata for itemized
pieces of information like "title" or "author". For example in the MP3
case you can get the title of the song and the name of the artist like

    InputStream stream = ...; // The MP3 stream
    Metadata metadata = new Metadata(); // Capture metadata

    Parser parser = new AutoDetectParser();
    ContentHandler handler = new DefaultHandler(); // Ignore all output
    parser.parse(stream, handler, metadata):

    String title = metadata.get(Metadata.TITLE);
    String artist = metadata.get(Metadata.AUTHOR);


Jukka Zitting

View raw message