tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Bastian <stephane.bast...@otrix.com>
Subject Suggestion to return XML sax events instead of XHTML sax events
Date Wed, 22 Oct 2008 16:22:12 GMT
Hello all,

First of, let me say that I'm really impressed with the state of Tika. 
I've been following Tika pretty much since day one and feel that a *lot* 
has been done in such a short period of time, especially looking at the 
fairly small number of people working on it.

Now I've got a  couple of comments and ideas for potential improvements, 
but the first one I would like to make is related to the HTML sax 
events. I feel that it's currently fairly difficult to use the 
information they are supposed to convey because XHTML type events are 
returned (and thus limiting the result to tag names and such allowed in 
XHTML). For instance, if you look at the MP3 parser, it currently 
returns something like this:

<h1>title of the song</h1>    --> the H1 is clearly just a container for 
the title, could have a P or a head/title or something else
<p>name of the artist</p>
<p>year</p>
....

It feels that a more XMLish set of events would make sense, such as 
something along the line:

<title>title of the song</title>
<artist>name of the artist</artist>
<year>year of the song</year>

The above example convey the same information but in a way that can be 
more easily leveraged by a Tika user.

The same comment goes for most of the parsers (Image, asm, audio, 
Office...) expected maybe the Html parser in which case it's fine 
because it's already Html ;)

So here go the questions:
1) Is there a reason that would prevent Tika from returning Xml type 
events as opposed to Xml events?

2) Do you feel XML events would provide substantial improvements over 
the current solution?


Once again, kudos to the team for the hard work.
All the best,

Stephane Bastian





Mime
View raw message