tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Tika discussions in Amsterdam
Date Tue, 08 May 2007 08:11:48 GMT
Hi,

On 5/3/07, Thilo Goetz <twgoetz@gmx.de> wrote:
> One other thing that we discussed was that it would make sense for some
> input formats (such as html) if Tika could produce output that allows
> mapping back to the input.  In other words, it should be possible
> (optionally) to know for each character in the output text where this
> character originated in the input.  This is useful, for example, for
> result highlighting.

I think the best technical solution to this (assuming we use XHTML SAX
events) is to embed such backmapping information as namespaced
attributes in the output event stream. For example a PDF document
could result in something like this:

    <html xmlns="...xhtml" xmlns:pdf="...tika-pdf-annotations">
      <head>...</head>
      <body>
        <h1 pdf:location="...">...</h1>
        <p pdf:location="...">...</p>
      </body>
    </html>

If more granularity is needed, the parser component could produce
extra <span/> elements for example for each line or even word in the
source document:

    ...<span pdf:location="...">...</span>...

> This may not be something for the early releases, but it would be good
> if we could keep this option in the back of our heads when designing the
> interfaces.

Agreed. I think a namespaced annotation mechanism like the one
suggested above would be an easy and forward-compatible way to add
such functionality.

BR,

Jukka Zitting

Mime
View raw message