tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Williams (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
Date Mon, 01 Feb 2016 16:16:39 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ian Williams updated TIKA-1845:
-------------------------------
    Attachment: example-that-fails.rtf

> Unable to extract content from certain RTFs using tika-server versions since 1.5 
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-1845
>                 URL: https://issues.apache.org/jira/browse/TIKA-1845
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.6, 1.9, 1.11
>         Environment: Windows
>            Reporter: Ian Williams
>         Attachments: example-that-fails.rtf
>
>
> I have some patient letters that are RTF documents.  When I extract the text from these
documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 1.11), it
fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing. 
> I wondered whether the error might be related to the following change that was introduced
in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
>     list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are real patient
letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> ====================
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf http://localhost:9998/tika
--header "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@32a6dc
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>         at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
>         at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
>         at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
>         at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
>         at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
>         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>         at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
>         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>         at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>         at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>         at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>         at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>         at org.eclipse.jetty.server.Server.handle(Server.java:370)
>         at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>         at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>         at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>         at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
>         at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>         at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>         at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>         at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>         at java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.NullPointerException
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
>         at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>         at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
>         at org.apache.tika.parser.rtf.RTFEmbObjHandler.extractObj(RTFEmbObjHandler.java:230)
>         at org.apache.tika.parser.rtf.RTFEmbObjHandler.handleCompletedObject(RTFEmbObjHandler.java:198)
>         at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1357)
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:456)
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
>         at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:86)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>         ... 34 more
> Feb 01, 2016 2:26:25 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem
> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4,
ContentType: text/plain



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message