lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Zhang <smartag...@gmail.com>
Subject Re: processing documents in solr
Date Sat, 27 Jul 2013 06:08:07 GMT
On a related, inspired by what you said, Shawn, an auto increment id seems
perfect here. Yet I found there is no such support in solr. The UUID only
guarantees uniqueness.


On Fri, Jul 26, 2013 at 10:50 PM, Joe Zhang <smartagent@gmail.com> wrote:

> Thanks for your kind reply, Shawn.
>
> On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey <solr@elyograg.org> wrote:
>
>> On 7/26/2013 11:02 PM, Joe Zhang wrote:
>> > I have an ever-growing solr repository, and I need to process every
>> single
>> > document to extract statistics. What would be a reasonable process that
>> > satifies the following properties:
>> >
>> > - Exhaustive: I have to traverse every single document
>> > - Incremental: in other words, it has to allow me to divide and conquer
>> ---
>> > if I have processed the first 20k docs, next time I can start with
>> 20001.
>>
>> If your index isn't very big, a *:* query with rows and start parameters
>> is perfectly acceptable.  Performance is terrible for this method when
>> the index gets huge, though.
>>
>
> ==> Essentially we are doing paigination here, right? If performance is
> not the concern, given that the index is dynamic, does the order of
> entries remain stable over time?
>
>
>
>> If "id" is your uniqueKey field, here's how you can do it.  If that's
>> not your uniqueKey field, substitute your uniqueKey field for id.  This
>> method doesn't work properly if you don't use a field with values that
>> are guaranteed to be unique.
>>
>> For the first query, send a query with these parameters, where NNNNNN is
>> the number of docs you want to retrieve at once:
>> q=*:*&rows=NNNNNN&sort=id asc
>>
>> For each subsequent query, use the following parameters, where XXX is
>> the highest id value seen in the previous query:
>> q={XXX TO *}&rows=NNNNNN&sort=id asc
>>
>> ==> This approach seems to require that the id field is numerical, right?
> I have a text-based id that is unique.
>
> ==> I'm not sure I understand the "q={XXX TO *}" part --> wouldn't query
> be matched against the default search field, which could be "content", for
> example? How would that do the job?
>
>
>> As soon as you see a numFound value less than NNNNNN, you will know that
>> there's no more data.
>>
>> Generally speaking, you'd want to avoid updating the index while doing
>> these queries.  If you never replace existing documents and you can
>> guarantee that the value in the uniqueKey field for new documents will
>> always be higher than any previous value, then you could continue
>> updating the index.  A database autoincrement field would qualify for
>> that condition.
>>
>> Thanks,
>> Shawn
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message