tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Invisible text displayed for headings in doc files
Date Wed, 06 Apr 2011 14:30:31 GMT
Hi guys,

We are currently getting duplicated text for the heading from .doc files

*<p class="index_Heading"><b>29. No Partnership or Agency</b><b> XE
"29. No
Partnership or Agency" </b></p>*

XE seems to be a flag in MS Word
http://taxonomist.tripod.com/indexing/wordflags.html but I don't think it
should be displayed.

Have I missed a parameter somewhere that could be used to hide these things
or shall I open a JIRA?

BTW is the class name vary from one user to another (depending on the
stylesheet) or is it consistent?



*Open Source Solutions for Text Engineering


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message