lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gytis Mikuciunas <gyt...@gmail.com>
Subject Re: Solr 6.4. Can't index MS Visio vsdx files
Date Mon, 03 Jul 2017 13:14:35 GMT
hi,

So I'm back from my long vacations :)

I'm trying to bring-up a fresh solr 6.6 standalone instance on windows
2012R2 server.

Replaced:

poi-*3.15-beta1 ---> poi-*3.16
tika-*1.13 ---> tika-*1.15


Tried to index one txt file and got (with poi and tika files that come out
of the box, it indexes this txt file without errors):


SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 500 Server Error</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /solr/v20170703xxx/update/extract. Reason:
<pre>    Server Error</pre></p><h3>Caused
by:</h3><pre>java.lang.NoClassDefFoundError:
org/apache/commons/compress/archivers/ArchiveStreamProvider
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(Unknown Source)
        at java.security.SecureClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.access$100(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at
org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:112)
        at
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:83)
        at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:115)
        at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
        at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at org.eclipse.jetty.server.Server.handle(Server.java:534)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
        at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
        at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException:
org.apache.commons.compress.archivers.ArchiveStreamProvider
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 51 more
</pre>
<h3>Caused by:</h3><pre>java.lang.ClassNotFoundException:
org.apache.commons.compress.archivers.ArchiveStreamProvider
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(Unknown Source)
        at java.security.SecureClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.access$100(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at
org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:112)
        at
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:83)
        at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:115)
        at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
        at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at org.eclipse.jetty.server.Server.handle(Server.java:534)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
        at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
        at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Unknown Source)
</pre>

</body>
</html>
SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 500 for URL:
http://localhost:80/solr/v20170703xxx/update/extract?resource.name=xxxxxx
1 files indexed.
COMMITting Solr index changes to
http://localhost:80/solr/v20170703xxx/update...
Time spent: 0:00:00.350



On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B. <tallison@mitre.org>
wrote:

> https://issues.apache.org/jira/browse/SOLR-10335 is tracking the upgrade
> in Solr to Tika 1.15.  Please chime in on that issue.
>
> You should be able to swap in POI 3.16 (final) wherever you had earlier
> versions, make sure to include: poi, poi-scratchpad, poi-ooxml,
> poi-ooxml-schemas.  And make sure to include tika-parsers (1.15),
> tika-core, tika-java7, tika-xmp.  Also, include commons-collections4 (which
> is new in POI w Tika 1.14).  (I assume you have already added curvesapi?)
>
> -----Original Message-----
> From: Gytis Mikuciunas [mailto:gytmkc@gmail.com]
> Sent: Saturday, June 3, 2017 5:39 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Great Tim.
>
> What do I need to do to integrate it on my current installation?
>
>
> On May 31, 2017 16:24, "Allison, Timothy B." <tallison@mitre.org> wrote:
>
> Apache Tika 1.15 is now available.
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:tallison@mitre.org]
> Sent: Tuesday, May 9, 2017 7:45 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Probably better to ask on the Tika list.  We'll push the release asap
> after PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for
> PDFBox this Friday.  Tika will probably have an RC by Monday 5/15, with the
> release happening later in the week...That's if there are no surprises...[2]
>
> You can get a recent build if you'd like to test [1].
>
> Best,
>
>           Tim
>
> [1] https://builds.apache.org/view/Tika/job/Tika-trunk/
> [2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and
> 2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/
> reports/reports_pdfbox_2_0_6.tar.gz
>
> -----Original Message-----
> From: Gytis Mikuciunas [mailto:gytmkc@gmail.com]
> Sent: Tuesday, May 9, 2017 7:17 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> Are there any news regarding Tika 1.15? Maybe it's already ready for
> download somewhere
>
> G.
>
> On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <tallison@mitre.org>
> wrote:
>
> > The release candidate for POI was just cut...unfortunately, I think
> > after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for
> opening that!
> >
> > That'll be done within a week unless there are surprises.  Once that's
> > out, I have to update a few things, but I'd think we'd have a
> > candidate for Tika a week later, then a week for release.
> >
> > You can get nightly builds here: https://builds.apache.org/
> >
> > Please ask on the POI or Tika users lists for how to get the
> > latest/latest running, and thank you, again, for opening the issue on
> POI's Bugzilla.
> >
> > Best,
> >
> >            Tim
> >
> > -----Original Message-----
> > From: Gytis Mikuciunas [mailto:gytmkc@gmail.com]
> > Sent: Wednesday, April 12, 2017 1:00 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
> >
> > when 1.15 will be released? maybe you have some beta version and I
> > could test it :)
> >
> > SAX sounds interesting, and from info that I found in google it could
> > solve my issues.
> >
> > On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B.
> > <tallison@mitre.org>
> > wrote:
> >
> > > It depends.  We've been trying to make parsers more, erm, flexible,
> > > but there are some problems from which we cannot recover.
> > >
> > > Tl;dr there isn't a short answer.  :(
> > >
> > > My sense is that DIH/ExtractingDocumentHandler is intended to get
> > > people up and running with Solr easily but it is not really a great
> > > idea for production.  See Erick's gem: https://lucidworks.com/2012/
> > > 02/14/indexing-with-solrj/
> > >
> > > As for the Tika portion... at the very least, Tika _shouldn't_ cause
> > > the ingesting process to crash.  At most, it should fail at the file
> > > level and not cause greater havoc.  In practice, if you're
> > > processing millions of files from the wild, you'll run into bad
> > > behavior and need to defend against permanent hangs, oom, memory leaks.
> > >
> > > Also, at the least, if there's an exception with an embedded file,
> > > Tika should catch it and keep going with the rest of the file.  If
> > > this doesn't happen let us know!  We are aware that some types of
> > > embedded file stream problems were causing parse failures on the
> > > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't
> > > let them percolate up through the parent file (they're reported in
> > > the
> > metadata though).
> > >
> > > Specifically for your stack traces:
> > >
> > > For your initial problem with the missing class exceptions -- I
> > > thought we used to catch those in docx and log them.  I haven't been
> > > able to track this down, though.  I can look more if you have a need.
> > >
> > > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type'
> > > name 'PolylineTo' ", this problem might go away if we implemented a
> > > pure SAX parser for vsdx.  We just did this for docx and pptx
> > > (coming in 1.15) and these are more robust to variation because they
> > > aren't requiring a match with the ooxml schema.  I haven't looked
> > > much at vsdx, but that _might_ help.
> > >
> > > For "TODO Support v5 Pointers", this isn't supported and would
> > > require contributions.  However, I agree that POI shouldn't throw a
> > > Runtime exception.  Perhaps open an issue in POI, or maybe we should
> > > catch this special example at the Tika level?
> > >
> > > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI
> > > team _might_ be able to modify the parser to ignore a stream if
> > > there's an exception, but that's often a sign that something needs
> > > to be fixed with the parser.  In short, the solution will come from
> POI.
> > >
> > > Best,
> > >
> > >              Tim
> > >
> > > -----Original Message-----
> > > From: Gytis Mikuciunas [mailto:gytmkc@gmail.com]
> > > Sent: Tuesday, April 11, 2017 1:56 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
> > >
> > > Thanks for your responses.
> > > Are there any posibilities to ignore parsing errors and continue
> > indexing?
> > > because now solr/tika stops parsing whole document if it finds any
> > > exception
> > >
> > > On Apr 11, 2017 19:51, "Allison, Timothy B." <tallison@mitre.org>
> wrote:
> > >
> > > > You might want to drop a note to the dev or user's list on Apache
> POI.
> > > >
> > > > I'm not extremely familiar with the vsd(x) portion of our code base.
> > > >
> > > > The first item ("PolylineTo") may be caused by a mismatch btwn
> > > > your doc and the ooxml spec.
> > > >
> > > > The second item appears to be an unsupported feature.
> > > >
> > > > The third item may be an area for improvement within our
> > > > codebase...I can't tell just from the stacktrace.
> > > >
> > > > You'll probably get more helpful answers over on POI.  Sorry, I
> > > > can't help with this...
> > > >
> > > > Best,
> > > >
> > > >            Tim
> > > >
> > > > P.S.
> > > > >  3.1. ooxml-schemas-1.3.jar instead of
> > > > > poi-ooxml-schemas-3.15.jar
> > > >
> > > > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super
> > > > set of poi-ooxml-schemas-3.15.jar
> > > >
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message