tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Bastian <stephane_bast...@hotmail.com>
Subject Re: Suggestion to return XML sax events instead of XHTML sax events
Date Thu, 23 Oct 2008 10:19:14 GMT
Hi Jukka,

Thanks for the detailed explanation.

Let me give you more details about how I'm using Tika then:

I've been using
String title = metadata.get(Metadata.TITLE);
and String artist = metadata.get(Metadata.AUTHOR);

to get the title and author of the MP3 file. However, I also need to 
know other information such as year, album and such which are not 
available in the medata. I then looked around and realized the 
contentHandler contained everything I needed. Unfortunately, there 
wasn't any reliable way to get the data (year, album) out of the handler 
(remember that all data are inside a P tag).

So maybe the real solution to the problem is to put the year, album and 
such in the metadata, which make perfect since they all are metadata 
about the song.
Would you agree with this?
If yes let me know so I can make the modification and submit a patch

All the best,

Stephane Bastian

Jukka Zitting wrote:
> Hi,
> On Wed, Oct 22, 2008 at 6:05 PM, Stephane Bastian
> <stephane_bastian@hotmail.com> wrote:
> > 1) Is there a reason that would prevent Tika from returning Xml type events
> > as opposed to Xml events?
> The idea behind Tika is to allow client applications to access the
> content of a document with little or no knowledge of the internal
> structure of that document. Inventing new XML vocabularies in Tika to
> better match the underlying document structures would violate this
> design goal.
> We chose XHTML as the output format instead of a plain character
> stream to convey structural information like headings and links to
> client applications that need or benefit from such information. For
> example in the MP3 case you mentioned a full text indexer could use
> the <h1/> tag to boost the importance of the title over the other
> extracted text. This would be impossible or at least noticeably harder
> if we used custom XML vocabularies.
> > 2) Do you feel XML events would provide substantial improvements over the
> > current solution?
> No. You should be looking at the extracted metadata for itemized
> pieces of information like "title" or "author". For example in the MP3
> case you can get the title of the song and the name of the artist like
> this:
>     InputStream stream = ...; // The MP3 stream
>     Metadata metadata = new Metadata(); // Capture metadata
>     Parser parser = new AutoDetectParser();
>     ContentHandler handler = new DefaultHandler(); // Ignore all output
>     parser.parse(stream, handler, metadata):
>     String title = metadata.get(Metadata.TITLE);
>     String artist = metadata.get(Metadata.AUTHOR);
> BR,
> Jukka Zitting

View raw message