lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brandon Waterloo <Brandon.Water...@matrix.msu.edu>
Subject RE: Problems indexing very large set of documents
Date Fri, 08 Apr 2011 16:44:50 GMT
I think I've finally found the problem.  The files that work are PDF version 1.6.  The files
that do NOT work are PDF version 1.4.  I'll look into updating all the old documents to PDF
1.6.

Thanks everyone!

~Brandon Waterloo
________________________________
From: Ezequiel Calderara [ezechico@gmail.com]
Sent: Friday, April 08, 2011 11:35 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

Maybe those files are created with a different Adobe Format version...

See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <Brandon.Waterloo@matrix.msu.edu<mailto:Brandon.Waterloo@matrix.msu.edu>>
wrote:
A second test has revealed that it is something to do with the contents, and not the literal
filenames, of the second set of files.  I renamed one of the second-format files and tested
it and Solr still failed.  However, the problem still only applies to those files of the second
naming format.
________________________________________
From: Brandon Waterloo [Brandon.Waterloo@matrix.msu.edu<mailto:Brandon.Waterloo@matrix.msu.edu>]
Sent: Friday, April 08, 2011 10:40 AM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: RE: Problems indexing very large set of documents

I had some time to do some research into the problems.  From what I can tell, it appears Solr
is tripping up over the filename.  These are strictly examples, but, Solr handles this filename
fine:

32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

However, it fails with either a parsing error or an EOF exception on this filename:

32-130-A08-84-al.sff.document.nusa197102.pdf

The only significant difference is that the second filename contains multiple periods.  As
there are about 1700 files whose filenames are similar to the second format it is simply not
possible to change their filenames.  In addition they are being used by other applications.

Is there something I can change in Solr configs to fix this issue or am I simply SOL until
the Solr dev team can work on this? (assuming I put in a ticket)

Thanks again everyone,

~Brandon Waterloo


________________________________________
From: Chris Hostetter [hossman_lucene@fucit.org<mailto:hossman_lucene@fucit.org>]
Sent: Tuesday, April 05, 2011 3:03 PM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: RE: Problems indexing very large set of documents

: It wasn't just a single file, it was dozens of files all having problems
: toward the end just before I killed the process.
       ...
: That is by no means all the errors, that is just a sample of a few.
: You can see they all threw HTTP 500 errors.  What is strange is, nearly
: every file succeeded before about the 2200-files-mark, and nearly every
: file after that failed.

..the root question is: do those files *only* fail if you have already
indexed ~2200 files, or do they fail if you start up your server and index
them first?

there may be a resource issued (if it only happens after indexing 2200) or
it may just be a problem with a large number of your PDFs that your
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is
using Tika to cause the problem -- if it's the later, then it's a straight
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your
solrconfig.xml


-Hoss



--
______
Ezequiel.

Http://www.ironicnet.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message