lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Indexing Best Practice
Date Tue, 12 Apr 2011 04:01:12 GMT
SOLR-1499 is a plug-in for the DIH that uses Solr as a DataSource.
This means that you can read the database and PDFs separately. You
could index all of the PDF content in one DIH script. Then, when
there's a database update, you have a separate DIH scripts that reads
the old row from Solr, and pulls the stripped text from the PDF, and
then re-indexes the whole thing. This would cut out the need to
reparse the PDF.

Lance

On Mon, Apr 11, 2011 at 8:48 AM, Shaun Campbell
<campbell.shaun@gmail.com> wrote:
> If it's of any help I've split the processing of PDF files from the
> indexing. I put the PDF content into a text file (but I guess you could load
> it into a database) and use that as part of the indexing.  My processing of
> the PDF files also compares timestamps on the document and the text file so
> that I'm only processing documents that have changed.
>
> I am a newbie so perhaps there's more sophisticated approaches.
>
> Hope that helps.
> Shaun
>
> On 11 April 2011 07:20, Darx Oman <darxoman@gmail.com> wrote:
>
>> Hi guys
>>
>> I'm wondering how to best configure solr to fulfills my requirements.
>>
>> I'm indexing data from 2 data sources:
>> 1- Database
>> 2- PDF files (password encrypted)
>>
>> Every file has related information stored in the database.  Both the file
>> content and the related database fields must be indexed as one document in
>> solr.  Among the DB data is *per-user* permissions for every document.
>>
>> The file contents nearly never change, on the other hand, the DB data and
>> especially the permissions change very frequently which require me to
>> re-index everything for every modified document.
>>
>> My problem is in process of decrypting the PDF files before re-indexing
>> them
>> which takes too much time for a large number of documents, it could span to
>> days in full re-indexing.
>>
>> What I'm trying to accomplish is eliminating the need to re-index the PDF
>> content if not changed even if the DB data changed.  I know this is not
>> possible in solr, because solr doesn't update documents.
>>
>> So how to best accomplish this:
>>
>> Can I use 2 indexes one for PDF contents and the other for DB data and have
>> a common id field for both as a link between them, *and results are treated
>> as one Document*?
>>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message