lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anuj Kumar <anujs...@gmail.com>
Subject Re: Problems indexing very large set of documents
Date Mon, 04 Apr 2011 18:48:18 GMT
In the log messages are you able to locate the file at which it fails? Looks
like TIKA is unable to parse one of your PDF files for the details. We need
to hunt that one out.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo <
Brandon.Waterloo@matrix.msu.edu> wrote:

> Looks like I'm using Tika 0.4:
> apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
> .../tika-parsers-0.4.jar
>
> ~Brandon Waterloo
>
> ________________________________________
> From: Anuj Kumar [anujsays@gmail.com]
> Sent: Monday, April 04, 2011 2:12 PM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> This is related to Apache TIKA. Which version are you using?
> Please see this thread for more details-
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> <http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
> >Hope
> it helps.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
> Brandon.Waterloo@matrix.msu.edu> wrote:
>
> >  Hey everybody,
> >
> > I've been running into some issues indexing a very large set of
> documents.
> >  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
> >  Obviously this is a big task for Solr.  I have a PHP script that
> iterates
> > over the directory and uses PHP cURL to query Solr to index the files.
>  For
> > now, commit is set to false to speed up the indexing, and I'm assuming
> that
> > Solr should be auto-committing as necessary.  I'm using the default
> > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>  Once
> > all the documents have been finished the PHP script queries Solr to
> commit.
> >
> > The main problem is that after a few thousand documents (around 2000 last
> > time I tried), nearly every document begins causing Java exceptions in
> Solr:
> >
> > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException:
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from
> > org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
> >        at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> >        at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> >        at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> >        at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> >        at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> >        at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >        at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> >        at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> >        at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> >        at
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> >        at
> >
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >        at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> >        at org.mortbay.jetty.Server.handle(Server.java:285)
> >        at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> >        at
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> >        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> >        at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> >        at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> >        at
> >
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> >        at
> >
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
> >        at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
> >        ... 23 more
> > Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> > secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
> >        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
> >        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
> >        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> >        ... 25 more
> >
> > As far as I know there's nothing special about these documents so I'm
> > wondering if it's not properly autocommitting.  What would be appropriate
> > settings in solrconfig.xml for this particular application?  I'd like it
> to
> > autocommit as soon as it needs to but no more often than that for the
> sake
> > of efficiency.  Obviously it takes long enough to index 4000 documents
> and
> > there's no reason to make it take longer.  Thanks for your help!
> >
> > ~Brandon Waterloo
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message