tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Denis Kildishev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1144) Changes in styling mechanism, inner table support and list support for Word Extractor
Date Wed, 03 Jul 2013 11:58:21 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Denis Kildishev updated TIKA-1144:
----------------------------------

    Attachment: word_style.patch

Current version of Word Extractor can be slightly improved by adding an intermediate layer
of tag handling. This approach helps to reduce the number of opened and closed style tags.
Also it helps to maintain opened tags, that is useful, for example, in a case of list handling.
In a case of included buffer, additional layer can be used in various situations where we
generates tags not in a place where we need to write them. As an example of such situations
may be inner tables or handling of special character runs.
This patch includes XHTMLWriteBuffer as class of buffered XHTML writing with supported tag
list and active indents. For basic XHTMLWriteBuffer, it provides tags directly to XHTMLContentHandler,
only used to supports an list of opened tags and indents. If the instance of this class is
created with a buffer of XHTML events(as it is made in a case of inner tables) it will save
all commands to XHTMLContentHandler in buffer to further writing by calling an apply method.
Talking about indent handling, it is better to support an information about document classes
in a form of CSS styles. This kind of style handling can help to reduce the amount of generated
data. Code sended in this patch generated a lot of information that can be moved to CSS. Also,
version of indent handling does not consider some unknown factors, so, can be not adequate.
List handling presented in this patch also not an ideal one. It tries to map word classes
of lists to small set of HTML list types. Another problem is with multi lists that can be
represented in word like so : "1.1.1.1.". As for HTML, it is not possible to make it by default
mechanisms. It is possible to display this list via CSS, but this mechanism is not presented
in patch.

PS. I am sorry about waiting, have some problems with svn diff
                
> Changes in styling mechanism, inner table support and list support for Word Extractor
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-1144
>                 URL: https://issues.apache.org/jira/browse/TIKA-1144
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Denis Kildishev
>            Priority: Minor
>         Attachments: word_style.patch
>
>
> Current version of Poi mechanisms can be used to support different kinds of styling and
list handling. For current moment, Tika supports for styling of separate Character Runs, but
this approach is not ideal and can lead to visual glitches in a form of pseudo spaces. 
> Another option is lists. Information about them already can be obtained from poi representation,
but this mechanism is not used in current version of Word Extractor.
> One of options that also can be solved now, is the problem of inner tables. It is not
clearly related to two problems before, but the solution of this problem is based on the same
mechanism as solution for previously listed problems. As an example of wrong handling can
be file with table that includes another table in the first cell. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message