tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hanssens Bart <Bart.Hanss...@fedict.be>
Subject improving odf / general questions on forms and deleted text
Date Sat, 25 Sep 2010 14:56:34 GMT
Hi,

I'm planning to further improve the ODF support in Tika. A few questions though,
that might also be useful for other formats:

Should Tika parse deleted text ? XHTML has INS and DEL, but they are to be used
where the content is removed / inserted, while ODF stores removed content at the
very beginning of the document (so "fixing" this will hurt performance, not sure if
that's worth it)
It can also be very confusing for the end user if one gets a result for "removed",
then again, it is somewhere in the document...

Forms: most form elements in ODF can be mapped to their HTML counterparts,
although I have to check if the result is always valid HTML (i.e., when ODF parent
and form element are mapped to HTML, is the HTML form still allowed within the
mapped parent)
Should they be mapped to HTML forms in the first place ? Or just to div / span ?

Best regards

Bart
Mime
View raw message