tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1953) tika-server NullPointerException while processing rtfs
Date Tue, 19 Apr 2016 10:59:25 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247552#comment-15247552
] 

Tim Allison edited comment on TIKA-1953 at 4/19/16 10:58 AM:
-------------------------------------------------------------

[~chrismattmann], your instincts are correct.  I'm able to reproduce this in pure Java in
a unit test.  This isn't a tika-server issue or a python issue.  

The problem is that the RTF parser opens/closes a list roughly as they show up in the file.
 If there's something corrupt in the list markers in the file, the RTFParser transmits as
is.  So, if you're using the ToXMLHandler, that'll throw the NPE if there's a closing </ul>
but no opening <ul>.  If you use the html, text or body handler, there's no problem.

As [~nicholasc] pointed out in the comment on TIKA-1513, we need to make the RTFParser more
robust to corrupt lists in RTF files.  This will take some time to get right.




was (Author: tallison@mitre.org):
I'm able to reproduce this in pure Java in a unit test.  This isn't a tika-server issue or
a python issue.  

The problem is that the RTF parser opens/closes a list roughly as they show up in the file.
 If there's something corrupt in the list markers in the file, the RTFParser transmits as
is.  So, if you're using the ToXMLHandler, that'll throw the NPE if there's a closing </ul>
but no opening <ul>.  If you use the html, text or body handler, there's no problem.

As [~nicholasc] pointed out in the comment on TIKA-1513, we need to make the RTFParser more
robust to corrupt lists in RTF files.  This will take some time to get right.



> tika-server NullPointerException while processing rtfs
> ------------------------------------------------------
>
>                 Key: TIKA-1953
>                 URL: https://issues.apache.org/jira/browse/TIKA-1953
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.12
>         Environment: Python 2.7.11 :: Anaconda 4.0.0 (64-bit)
> Red Hat Enterprise Linux Server release 6.7 (Santiago)
> java version "1.7.0_95"
> OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
> OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
>            Reporter: Ravi
>            Assignee: Tim Allison
>              Labels: newbie, rtf, tika-python, tika-server, xmlContent,
>             Fix For: 1.13
>
>         Attachments: officeinstallations3.rtf
>
>
> Looks like the xmlContent=True flag causes tika.py: Warn: Tika server returned status:
422 error
> I start the tika server and then run the following code in the python kernel at bash
> import tika
> from tika import parser
> parsed = parser.from_file('/path/to/file.rtf,'http://localhost:9003',xm
> lContent=True)
> I get.. tika.py: Warn: Tika server returned status: 422
> Looking at the tika-server log I get the following dump:
> Note: The parser seems to work fine without the xmlContent=True flag set. I get the right
output but setting this flag creates the NullPointerException below
> ------------------------------------------------------------------------------
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource logRequest
> INFO: rmeta/xml (autodetecting type)
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: rmeta/xml: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@21f0dbb9
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>         at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>         at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:281)
>         at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:138)
>         at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:119)
>         at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>         at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>         at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>         at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>         at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>         at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>         at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>         at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>         at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>         at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>         at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>         at org.eclipse.jetty.server.Server.handle(Server.java:370)
>         at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>         at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>         at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>         at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>         at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>         at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>         at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>         at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>         at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.access$000(ToXMLContentHandler.java:38)
>         at org.apache.tika.sax.ToXMLContentHandler.endElement(ToXMLContentHandler.java:195)
>         at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>         at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>         at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>         at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>         at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>         at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
>         at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:226)
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:478)
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
>         at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:87)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         ... 38 more
> ------------------------------------------------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message