lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brandon Waterloo <Brandon.Water...@matrix.msu.edu>
Subject Problems indexing very large set of documents
Date Mon, 04 Apr 2011 18:00:53 GMT
 Hey everybody,

I've been running into some issues indexing a very large set of documents.  There's about
4000 PDF files, ranging in size from 160MB to 10KB.  Obviously this is a big task for Solr.
 I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index
the files.  For now, commit is set to false to speed up the indexing, and I'm assuming that
Solr should be auto-committing as necessary.  I'm using the default solrconfig.xml file included
in apache-solr-1.4.1\example\solr\conf.  Once all the documents have been finished the PHP
script queries Solr to commit.

The main problem is that after a few thousand documents (around 2000 last time I tried), nearly
every document begins causing Java exceptions in Solr:

Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198:
Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        at org.mortbay.jetty.Server.handle(Server.java:285)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
        ... 23 more
Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt=''
org.pdfbox.io.PushBackInputStream@b19bfc
        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
        ... 25 more

As far as I know there's nothing special about these documents so I'm wondering if it's not
properly autocommitting.  What would be appropriate settings in solrconfig.xml for this particular
application?  I'd like it to autocommit as soon as it needs to but no more often than that
for the sake of efficiency.  Obviously it takes long enough to index 4000 documents and there's
no reason to make it take longer.  Thanks for your help!

~Brandon Waterloo

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message