tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-683) RTF Parser issues with non european characters
Date Wed, 17 Aug 2011 15:28:29 GMT

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086384#comment-13086384
] 

Michael McCandless commented on TIKA-683:
-----------------------------------------

Thanks Chris!

Actually both Christian's patch and mine are test cases.

Christian's test case fails (showing this issue); we don't yet have a patch to fix this issue
(but we know what's wrong -- we have to handle the \ucN control codes).

My test case (TIKA-683-unicode-testcase.patch) passes and can be committed right away -- it's
testing another aspect of RTF+Unicode which (happily) seems to be working correctly.

I also attached a new test case, passing, on TIKA-422, so if you could commit that one also
that'd be great!

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf,
testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message