lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: Indexing Best Practice
Date Tue, 12 Apr 2011 04:01:12 GMT
SOLR-1499 is a plug-in for the DIH that uses Solr as a DataSource.
This means that you can read the database and PDFs separately. You
could index all of the PDF content in one DIH script. Then, when
there's a database update, you have a separate DIH scripts that reads
the old row from Solr, and pulls the stripped text from the PDF, and
then re-indexes the whole thing. This would cut out the need to
reparse the PDF.


On Mon, Apr 11, 2011 at 8:48 AM, Shaun Campbell
<> wrote:
> If it's of any help I've split the processing of PDF files from the
> indexing. I put the PDF content into a text file (but I guess you could load
> it into a database) and use that as part of the indexing.  My processing of
> the PDF files also compares timestamps on the document and the text file so
> that I'm only processing documents that have changed.
> I am a newbie so perhaps there's more sophisticated approaches.
> Hope that helps.
> Shaun
> On 11 April 2011 07:20, Darx Oman <> wrote:
>> Hi guys
>> I'm wondering how to best configure solr to fulfills my requirements.
>> I'm indexing data from 2 data sources:
>> 1- Database
>> 2- PDF files (password encrypted)
>> Every file has related information stored in the database.  Both the file
>> content and the related database fields must be indexed as one document in
>> solr.  Among the DB data is *per-user* permissions for every document.
>> The file contents nearly never change, on the other hand, the DB data and
>> especially the permissions change very frequently which require me to
>> re-index everything for every modified document.
>> My problem is in process of decrypting the PDF files before re-indexing
>> them
>> which takes too much time for a large number of documents, it could span to
>> days in full re-indexing.
>> What I'm trying to accomplish is eliminating the need to re-index the PDF
>> content if not changed even if the DB data changed.  I know this is not
>> possible in solr, because solr doesn't update documents.
>> So how to best accomplish this:
>> Can I use 2 indexes one for PDF contents and the other for DB data and have
>> a common id field for both as a link between them, *and results are treated
>> as one Document*?

Lance Norskog

View raw message