tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: [jira] Commented: (TIKA-105) Excel parser implementation based on POI's Event API
Date Wed, 26 Dec 2007 21:16:17 GMT
Hi,

On Dec 26, 2007 9:38 PM, Niall Pemberton <niall.pemberton@gmail.com> wrote:
> On Dec 26, 2007 7:19 PM, Keith R. Bennett <kbennett@bbsinc.biz> wrote:
> > When you say it includes the sheet name, you mean the name of each sheet
> > (tab) in the Excel file, right? Does it come out as bare text, or is it
> > encoded in a way that can be parsed (e.g. "{[Sheet: MySheet1]}")?  Or is
> > this configurable?
>
> Just plain text and not configurable ATM.

Having to use a yet another parser on Tika output is something that we
should IMHO avoid as much as possible. A more reasonable way to make
the sheet structure available to clients that need it would be to use
the features of the XHTML output serialization.

How about something like this:

    <div class="sheet">
        <h1 class="sheet-title">....</h1>
        <p>...</p>
    </div>

or, if one wants to match Excel's screen representation more closely
(IMHO not a goal for Tika):

    <div class="sheet">
        <table>...</table>
        <p class="sheet-title">....</p>
    </div>

A client that needs the sheet content as structured data can then use
XPath queries like //div[@class='sheet'] or //*[@class='sheet-title']
to selectively extract the content of entire sheets or just their
titles.

> > We have a need to read Excel files with more structure than the usual
> > unstructured text document.  At minimum, it would be great to be able to be
> > able to know where one sheet ends and the next begins.  Is this something
> > that would be appropriate to support, or does that go beyond the generic
> > unstructured text parsing mission of Tika?
>
> I'm leave that for the Tika devs to comment on.

One of the stated goals for Tika is to support not only unstructured
but also structured text extraction. This goal was discussed at the
search roundtable in Amsterdam (see the followup thread at
http://markmail.org/message/ggihw2cns53t6ayl) and implemented on the
Parser API level by making the parsers output XHTML SAX events instead
of character streams (see TIKA-53).

Note however that the goal here is not to make Tika replace the native
Parser APIs, just produce structured enough output to satisfy the
needs of typical Tika clients.

I think Keith's need to distinguish sheet boundaries is within the
scope of Tika, but if one for example wants to find out detailed cell
formatting information they should instead be looking at the
underlying POI APIs.

BR,

Jukka Zitting

Mime
View raw message