tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cristian Vat (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-683) RTF Parser issues with non european characters
Date Thu, 18 Aug 2011 22:54:27 GMT

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087367#comment-13087367
] 

Cristian Vat commented on TIKA-683:
-----------------------------------

Thanks Mike for looking into the issues. I also know very little about RTF :)

Yes, the skipping is basically skip N ansi chars.
Actually the JDK RTFEditorKit/Reader already does this and does it well as far as I could
see.

There are also other flaws with the current filtering we do. For example binary data sequences
skipping is not handled correctly...

I went through all the classes in/used-by RTFEditorKit and it appears that it handles most
things correctly except the "\'xx" escape where it uses a default translation table not taking
into account the current font charset.
Right now I'm trying to figure out if I can add that behavior by subclassing RTFEditorKit/RTFReader.
That I think would be the best solution to this issue and other related ones. It would also
avoid temporary files and improve performance maybe.

Just in case it can't be done with subclassing, anybody know what the licensing restrictions
on the JDK classes is? (mainly RTFEditorKit, RTFReader ). It may be do-able with modifying
them a little...

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, testRTFJapanese.rtf,
testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message