tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cristian Vat (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-642) Few of RTF files not extracting properly
Date Sat, 06 Aug 2011 20:51:27 GMT

    [ https://issues.apache.org/jira/browse/TIKA-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080459#comment-13080459
] 

Cristian Vat commented on TIKA-642:
-----------------------------------

RTF file format starts with "{\rtf1" and the file ends when "}" closes the original opening
block.
So the attached file is not really valid according to the RTF spec.

Keeping a count of open blocks could work, but it's not quite trivial since we need to properly
handle also binary entities like embedded images.

I see the CompareRite software which seems to have generated that RTF has been retired since
2002 so nobody will probably fix how it generates RTF.
Note that this can also cause problems with other software which uses the swing RTF parser,
or also other software even non-java which doesn't know how to handle this special case.

> Few of RTF files not extracting properly
> ----------------------------------------
>
>                 Key: TIKA-642
>                 URL: https://issues.apache.org/jira/browse/TIKA-642
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9, 1.0
>         Environment: All
>            Reporter: Manish
>         Attachments: FIRM GAS GTC B RED.DOC
>
>
> Few of the RTF files dont get extracted properly. 
> This is the stack trace: 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.rtf.RTFParser@616d071a
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> Caused by: java.io.IOException: Too many close-groups in RTF text
> at javax.swing.text.rtf.RTFParser.write(RTFParser.java:156)
> at javax.swing.text.rtf.RTFParser.writeSpecial(RTFParser.java:101)
> at javax.swing.text.rtf.AbstractFilter.write(AbstractFilter.java:158)
> at javax.swing.text.rtf.AbstractFilter.readFromStream(AbstractFilter.java:88)
> at javax.swing.text.rtf.RTFEditorKit.read(RTFEditorKit.java:65)
> at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:112)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message