lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Zhang <smartag...@gmail.com>
Subject Re: processing documents in solr
Date Sat, 27 Jul 2013 05:50:41 GMT
Thanks for your kind reply, Shawn.

On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey <solr@elyograg.org> wrote:

> On 7/26/2013 11:02 PM, Joe Zhang wrote:
> > I have an ever-growing solr repository, and I need to process every
> single
> > document to extract statistics. What would be a reasonable process that
> > satifies the following properties:
> >
> > - Exhaustive: I have to traverse every single document
> > - Incremental: in other words, it has to allow me to divide and conquer
> ---
> > if I have processed the first 20k docs, next time I can start with 20001.
>
> If your index isn't very big, a *:* query with rows and start parameters
> is perfectly acceptable.  Performance is terrible for this method when
> the index gets huge, though.
>

==> Essentially we are doing paigination here, right? If performance is not
the concern, given that the index is dynamic, does the order of
entries remain stable over time?



> If "id" is your uniqueKey field, here's how you can do it.  If that's
> not your uniqueKey field, substitute your uniqueKey field for id.  This
> method doesn't work properly if you don't use a field with values that
> are guaranteed to be unique.
>
> For the first query, send a query with these parameters, where NNNNNN is
> the number of docs you want to retrieve at once:
> q=*:*&rows=NNNNNN&sort=id asc
>
> For each subsequent query, use the following parameters, where XXX is
> the highest id value seen in the previous query:
> q={XXX TO *}&rows=NNNNNN&sort=id asc
>
> ==> This approach seems to require that the id field is numerical, right?
I have a text-based id that is unique.

==> I'm not sure I understand the "q={XXX TO *}" part --> wouldn't query be
matched against the default search field, which could be "content", for
example? How would that do the job?


> As soon as you see a numFound value less than NNNNNN, you will know that
> there's no more data.
>
> Generally speaking, you'd want to avoid updating the index while doing
> these queries.  If you never replace existing documents and you can
> guarantee that the value in the uniqueKey field for new documents will
> always be higher than any previous value, then you could continue
> updating the index.  A database autoincrement field would qualify for
> that condition.
>
> Thanks,
> Shawn
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message