tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-683) RTF Parser issues with non european characters
Date Fri, 19 Aug 2011 13:08:27 GMT

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087694#comment-13087694
] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

bq. Right now I'm trying to figure out if I can add that behavior by subclassing RTFEditorKit/RTFReader.


Ooh that sounds interesting!  Does it have enough hooks so a subclass
can "tag along" to know what font is in-use and then intercept the
\'XX hex escapes?

Poaching either Harmony's parser or maybe OpenOffice's (C, but we
could port the parts we poach to Java) seems like a good way to go?

Either that or we make our own simple tokenizer?  The RTF spec looks
[relatively] simple enough, and Tika only needs to get the text out
(at least for today?), so we need not do heavy parsing of all
formatting / document structure.  A simple tokenizer that just decoded
the control words we care about (charset, font default, charset,
table) should work well and be robust to parser bugs / small errors in
the doc.

I'm also worried about the test coverage of the our RTF
parsing... would be nice to find (or somehow randomly generate) some
biggish collection of RTF + "expected text" test cases.  Maybe we can
poach tests from OpenOffice....

I noticed some tests allow for / expect extra whitespace to be
inserted in the returned text, but that makes me nervous... I think
(ideally) Tika shouldn't insert extra whitespace if we can help it.
Though, some cases likely need it, eg text from adjacent table cells.


> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf,
testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message