tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-683) RTF Parser issues with non european characters
Date Fri, 02 Sep 2011 10:58:09 GMT

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095904#comment-13095904
] 

Jukka Zitting commented on TIKA-683:
------------------------------------

+1, I'm eager to see us drop the javax.swing dependency with something we can directly fix
and improve.

The org.apache.tika.sax.SaveContentHandler class already does some sanitization of SAX events,
so that might be a good place to also check that tags are correctly nested. Though as Uwe
said, ideally the generator of the SAX events would already take care of producing valid output.

PS. I'd rather use a separate .java file for the ExtractRTFText class than have it as a static
inner class inside RTFParser. We can keep it package-private if we don't want to expose it
directly to downstream clients.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, TIKA-683.patch,
TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf, testWORD_bold_character_runs.docx,
testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message