tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gaurav (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2107) Old MS Word files give error while indexing
Date Wed, 05 Oct 2016 03:57:20 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gaurav updated TIKA-2107:
-------------------------
    Affects Version/s: 1.8
          Description: 
error while indexing old MS word files

Screen shot of Tika 2.0 attached. 

Error with Tika 1.8:
Log of Tika 1.8:

INFO: meta (application/msword)
Oct 04, 2016 6:42:30 PM org.apache.tika.server.resource.TikaResource parse
WARNING: meta: Text extraction failed
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@7260e439
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:287)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:238)
	at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:134)
	at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:67)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
	at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
	at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
	at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
	at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
	at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
	at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
	at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
	at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
	at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
	at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at org.eclipse.jetty.server.Server.handle(Server.java:370)
	at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
	at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
	at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
	at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
	at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
	at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature;
read 0x04094031002DA5DB, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid
OLE2 document
	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:167)
	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:117)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:291)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:166)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
	... 38 more

  was:error while indexing old MS word files


> Old MS Word files give error while indexing
> -------------------------------------------
>
>                 Key: TIKA-2107
>                 URL: https://issues.apache.org/jira/browse/TIKA-2107
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-batch
>    Affects Versions: 1.8, 2.0
>         Environment: ubuntu
>            Reporter: Gaurav
>              Labels: patch
>         Attachments: plen281.doc
>
>
> error while indexing old MS word files
> Screen shot of Tika 2.0 attached. 
> Error with Tika 1.8:
> Log of Tika 1.8:
> INFO: meta (application/msword)
> Oct 04, 2016 6:42:30 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: meta: Text extraction failed
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@7260e439
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:287)
> 	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:238)
> 	at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:134)
> 	at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:67)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
> 	at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
> 	at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
> 	at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
> 	at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> 	at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> 	at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> 	at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> 	at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> 	at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> 	at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> 	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> 	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> 	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> 	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> 	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> 	at org.eclipse.jetty.server.Server.handle(Server.java:370)
> 	at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
> 	at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
> 	at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
> 	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
> 	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> 	at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
> 	at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
> 	at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
> 	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> 	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature;
read 0x04094031002DA5DB, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid
OLE2 document
> 	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:167)
> 	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:117)
> 	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:291)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:166)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> 	... 38 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message