lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Zhang <smartag...@gmail.com>
Subject Re: processing documents in solr
Date Sat, 27 Jul 2013 17:17:32 GMT
Thanks for sharing, Roman. I'll look into your code.

One more thought on your suggestion, Shawn. In fact, for the id, we need
more than "unique" and "rangeable"; we also need some sense of atomic
values. Your approach might run into risk with a text-based id field, say:

the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your
suggestion would work fine. But with newly added documents, there is no
guarantee that they are not going to use the key value 'b'. And this new
document would be missed in your algorithm, right?


On Sat, Jul 27, 2013 at 5:32 AM, Roman Chyla <roman.chyla@gmail.com> wrote:

> Dear list,
> I'vw written a special processor exactly for this kind of operations
>
>
> https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch
>
> This is how we use it
> http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch
>
> It is capable of processing index of 200gb in few minutes,
> copying/streaming large amounts of data is normal
>
> If there is general interest, we can create jira issue - but given my
> current workload time, it will take longer and also somebody else will
> *have to* invest their time and energy in testing it, reporting, etc. Of
> course, feel free to create the jira yourself or reuse the code -
> hopefully, you will improve it and let me know ;-)
>
> Roman
> On 27 Jul 2013 01:03, "Joe Zhang" <smartagent@gmail.com> wrote:
>
> > Dear list:
> >
> > I have an ever-growing solr repository, and I need to process every
> single
> > document to extract statistics. What would be a reasonable process that
> > satifies the following properties:
> >
> > - Exhaustive: I have to traverse every single document
> > - Incremental: in other words, it has to allow me to divide and conquer
> ---
> > if I have processed the first 20k docs, next time I can start with 20001.
> >
> > A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> > fact, given that the processing will take very long, and the repository
> > keeps growing, it is not even clear that the exhaustiveness is achieved.
> >
> > I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> > yet. But I guess the same issues still hold even if I have the solr cloud
> > environment, right, say in each shard?
> >
> > Any help would be greatly appreciated.
> >
> > Joe
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message