tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Denis Kildishev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1144) Changes in styling mechanism, inner table support and list support for Word Extractor
Date Wed, 03 Jul 2013 12:20:23 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Denis Kildishev updated TIKA-1144:
----------------------------------

    Attachment: word_style.patch

Add better version
                
> Changes in styling mechanism, inner table support and list support for Word Extractor
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-1144
>                 URL: https://issues.apache.org/jira/browse/TIKA-1144
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Denis Kildishev
>            Priority: Minor
>         Attachments: word_style.patch
>
>
> Current version of Poi mechanisms can be used to support different kinds of styling and
list handling. For current moment, Tika supports for styling of separate Character Runs, but
this approach is not ideal and can lead to visual glitches in a form of pseudo spaces. 
> Another option is lists. Information about them already can be obtained from poi representation,
but this mechanism is not used in current version of Word Extractor.
> One of options that also can be solved now, is the problem of inner tables. It is not
clearly related to two problems before, but the solution of this problem is based on the same
mechanism as solution for previously listed problems. As an example of wrong handling can
be file with table that includes another table in the first cell. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message