tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-683) RTF Parser issues with non european characters
Date Mon, 15 Aug 2011 10:55:31 GMT

    [ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085021#comment-13085021
] 

Michael McCandless commented on TIKA-683:
-----------------------------------------


NOTE: I know very little about RTF!  So please forgive/correct any
confusions below:

It looks like we need a stack to record the \ucN control chars we've
encountered, at each depth, and we must then skip N ansi chars after
each \uXXXX we see?  (Similarly to how we track the charset with
charsetQueue now).

Ie, on seeing \uXXXX (possibly followed by trailing space, which does
not count in the skip count), we parse and keep that XXXX unicode
character, re-emitting the \uXXXX in our output data, but then we
remove the following N ansi chars.

Some other things I noticed in RTFParser.java; I'm not sure if they
are really a problem in pratice:

  * I'm worried about how we replace \cell with \u0020\cell --
    depending on the last \ucN control word, this could mean we
    incorrectly skip some number of ansi chars?  Changing to
    {\u20}\cell would be safer since on group end the pending skip
    chars are reset to 0.

  * But then I also wonder if all the additional groups we are
    creating (because we surround each \uXXXX with { }) are somehow
    costly, eg if it causes RTFEditorKit to use more RAM / be slower /
    something.

  * When we look for the \ansicpgNNNN control word, I noticed we then
    look up the NNNN in the FONTSET_MAP -- is that wrong?  EG when I
    look at the possible values for NNNN (at
    http://latex2rtf.sourceforge.net/rtfspec_6.html) I see a bunch of
    numbers that aren't in the FONTSET_MAP.  We also use FONTSET_MAP
    for \fcharsetNNN but the values for that control word look
    correct.

  * We don't seem to handle the opening charset in the RTF header (ie,
    \ansi, \mac, \pc, \pca)?


> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: TIKA-683.patch, testRTFJapanese.rtf, testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message