manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ronny Heylen <securaqbere...@gmail.com>
Subject Re: Error in Manifoldcf, what's the first step?
Date Wed, 06 Nov 2013 12:13:25 GMT
Thanks Adrian for the link, it has solved our problem with JPG. We hope
that it will be included in SOLR 5.0 has announced in the bug report.


On Tue, Oct 29, 2013 at 4:50 PM, Adrian Conlon <Adrian.Conlon@arup.com>wrote:

>  Looks like a solr/tika issue with JPEG file metadata extraction:
>
>
>
> https://issues.apache.org/jira/browse/SOLR-4645
>
>
>
> The JIRA issue contains a workaround which looks reasonable.  I should
> note that I haven’t tried this…
>
>
>
> Adrian
>
>
>
> *From:* Ronny Heylen [mailto:securaqbereusr@gmail.com]
> *Sent:* 29 October 2013 15:35
> *To:* Karl Wright; Adrian Conlon
> *Cc:* user@manifoldcf.apache.org
> *Subject:* Re: Error in Manifoldcf, what's the first step?
>
>
>
> The help on file size was great, now we still have the problem on small
> jpg.
> solr.log contains:
>
> ERROR - 2013-10-29 15:47:19.815; org.apache.solr.common.SolrException;
> null:java.lang.RuntimeException: java.lang.NoClassDefFoundError:
> com/adobe/xmp/XMPException
>     at
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673)
>     at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383)
>     at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
>     at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>     at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>     at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
>     at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
>     at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
>     at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
>     at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
>     at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>     at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
>     at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
>     at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
>     at
> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
>     at
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
>     at
> com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
>     at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
>     at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>     at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
>     at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>     at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>     at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
>     at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
>     at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
>     ... 16 more
> Caused by: java.lang.ClassNotFoundException: com.adobe.xmp.XMPException
>     at java.net.URLClassLoader$1.run(Unknown Source)
>     at java.net.URLClassLoader$1.run(Unknown Source)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at java.net.URLClassLoader.findClass(Unknown Source)
>     at java.lang.ClassLoader.loadClass(Unknown Source)
>     at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
>     at java.lang.ClassLoader.loadClass(Unknown Source)
>     ... 30 more
>
>
>
> On Tue, Oct 29, 2013 at 1:25 PM, Ronny Heylen <securaqbereusr@gmail.com>
> wrote:
>
>   That was a very good suggestion!
>
> Setting the max size has solved the problem for the first subfolder on
> which we test.
>
> Now we wil retry on the full drive and let you know the result.
>
>
>
> On Tue, Oct 29, 2013 at 12:12 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>   Based on the error message, Adrian is correct and this is once again a
> solr side problem.  Since solr puts all documents into memory, my guess is
> that you are attempting to index some very large documents and those are
> causing solr to run out of memory.  Either exclude these from the crawl or
> set a reasonable maximum length.
>
> Karl
>
> Sent from my Windows Phone
>   ------------------------------
>
> *From: *Ronny Heylen
> *Sent: *10/29/2013 6:52 AM
>
>
> *To: *user@manifoldcf.apache.org
> *Subject: *Error in Manifoldcf, what's the first step?
>
> Hi,
>
>
> Solr is 4.4, manifoldcf 1.3.
>
>
> We are indexing a shared windows network drive, filtering on *.doc*,
> *.xls*, *.pdf ... with about 650,000 files to index, giving a SOLR index
> 35GB in size.
>
>
> The result is great except that the manifoldcf job crashes before the end.
>
> Note that:
> - ignoreTikaException is true in solrconfig.xml (otherwise the manifoldcf
> job stops very early).
>
> - tomcat has been given 24 GB of memory (it uses 15GB)
>
> - there are 8 cores
>
>
> Message in http://localhost:8080/mcf-crawler-ui/showjobstatus.jsp is:
> Error: Repeated service interruptions - failure processing document:
> Server at http://localhost:8080/solr/collection1 returned non ok
> status:500, message:Internal Server Error
>
> Then, instead of indexing the full drive in one job, we have defined one
> job for each subfolder.
>
> Almost all "subfolder" jobs end successfully, only for 2 or 3 we receive
> the same message, and for 2 or 3 other ones a different message:
>
> Error: Repeated service interruptions - failure processing document: Read
> timed out
>
> If we try to go further (defining one job for each subfolder of a
> subfolder in error), the same happens: success for almost all subfolders
> except 1 or 2.
>
> What is the first step to do to solve this problem?
>
> Thanks.
>
>
>
>
>
> ____________________________________________________________
>
> Electronic mail messages entering and leaving Arup  business
> systems are scanned for acceptability of content and viruses
>
>

Mime
View raw message