tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
Date Mon, 01 Feb 2016 14:46:40 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126317#comment-15126317
] 

Nick Burch commented on TIKA-1845:
----------------------------------

Near the top of the jira page are some buttons, please hit "More" then "Attach Files", and
then upload the smallest file you have which triggers the issue. We can then use that for
investigating, testing and (hopefully!) later unit testing of fixes.

> Unable to extract content from certain RTFs using tika-server versions since 1.5 
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-1845
>                 URL: https://issues.apache.org/jira/browse/TIKA-1845
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.6, 1.9, 1.11
>         Environment: Windows
>            Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text from these
documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 1.11), it
fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files to this
issue so here is a link to an Evernote note containing an example RTF that fails:
> https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290
> I wondered whether the error might be related to the following change that was introduced
in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
>     list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are real patient
letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> ====================
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf http://localhost:9998/tika
--header "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@32a6dc
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>         at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
>         at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
>         at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
>         at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
>         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>         at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
>         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>         at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>         at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>         at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>         at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>         at org.eclipse.jetty.server.Server.handle(Server.java:370)
>         at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>         at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>         at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>         at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
>         at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>         at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>         at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>         at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>         at java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.NullPointerException
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
>         at org.apache.tika.parser.rtf.RTFEmbObjHandler.extractObj(RTFEmbObjHandler.java:230)
>         at org.apache.tika.parser.rtf.RTFEmbObjHandler.handleCompletedObject(RTFEmbObjHandler.java:198)
>         at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1357)
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:456)
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
>         at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:86)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>         ... 34 more
> Feb 01, 2016 2:26:25 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem
> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4,
ContentType: text/plain



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message