tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: The ODF toolkit
Date Sat, 15 Nov 2008 13:39:02 GMT

I am really interested in helping in TIKA development. I like the real good
TIKA design with SAX events!

> Hi,
> Check out the new ODF toolkit project [1]. Especially the ODFDOM
> library [2] seems like something we could use in Tika to better
> extract stuff from OpenDocument files.
> [1] http://odftoolkit.org/
> [2] http://odftoolkit.org/projects/odftoolkit/pages/ODFDOM
> BR,
> Jukka Zitting

I have seen this project, too. The problem with it is, that it only has
Mappings for the Object definitions as customized DOM objects, but that does
not really help you when importing the text.

TIKA's big advantage is the possibility to use SAX events when importing XML
formats. I am currently working on a patch for the ODF importer, that maps
content.xml's tags to XHTML tags. This can be done very simple by a new SAX
filter: TagMappingContentHandler.

I prepare to post 2 patches to TIKA's issue management system, that:

a) import ODF documents as structured XHTML items as mentioned before.

b) a better conversion of XHTML sax streams to plain text (better than just
only reading characters() events), as the problem here is the difference
between HTML block and span elements. Just reading the element contents
creates whitespace issues...

The same technique could be used for Open XML (Office 2007) items. Using the
new classes of POI is a pain (the same problem: thousands of ne objects from
a really big JAR file that just contains DOM not SAX mappings for Open XML
objects). A clean SAX solution would be preferable.

Just give me some more two days to finish my patches!


Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
eMail: uwe@thetaphi.de

View raw message