tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Tikhonov <olegtikho...@gmail.com>
Subject Re: [jira] Commented: (TIKA-422) Wrong charset conversion in some RTF documents.
Date Wed, 12 May 2010 13:52:10 GMT
Hi Jukka.
Here are my thoughts:
1. From nutch
http://www.docjar.com/docs/api/org/apache/nutch/parse/rtf/package-index.html

2. OpenOffice writer Java API
http://wiki.services.openoffice.org/wiki/API/Samples/Java/Writer/TextDocumentStructure

Oleg.


On Wed, May 12, 2010 at 4:06 PM, Jukka Zitting (JIRA) <jira@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/TIKA-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866559#action_12866559]
>
> Jukka Zitting commented on TIKA-422:
> ------------------------------------
>
> Does anyone know an alternative RTF parser in Java with a friendly license
> [1]? It looks like there's little we can do about this as long as we're
> stuck with the Swing RTF parser.
>
> [1] http://www.apache.org/legal/resolved.html
>
>
> > Wrong charset conversion in some RTF documents.
> > -----------------------------------------------
> >
> >                 Key: TIKA-422
> >                 URL: https://issues.apache.org/jira/browse/TIKA-422
> >             Project: Tika
> >          Issue Type: Bug
> >          Components: parser
> >    Affects Versions: 0.7
> >            Reporter: Piotr B.
> >         Attachments: test-windows-1250.rtf
> >
> >
> > RTF parser uses javax.swing.text.rtf, but it sucks.
> > It doesn't support '\ansicpg' tag (cite from RTF file format
> specification:
> > "This keyword represents the default ANSI code page used to perform the
> Unicode to ANSI conversion when writing RTF text").
> > Unfortunately Windows WordPad saves nonascii characters using \ansicpg
> instead of supported by javax.swing.text.rtf unicode characters.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Best regards, Oleg.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message