lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roman Chyla <>
Subject Re: processing documents in solr
Date Sat, 27 Jul 2013 12:32:42 GMT
Dear list,
I'vw written a special processor exactly for this kind of operations

This is how we use it

It is capable of processing index of 200gb in few minutes,
copying/streaming large amounts of data is normal

If there is general interest, we can create jira issue - but given my
current workload time, it will take longer and also somebody else will
*have to* invest their time and energy in testing it, reporting, etc. Of
course, feel free to create the jira yourself or reuse the code -
hopefully, you will improve it and let me know ;-)

On 27 Jul 2013 01:03, "Joe Zhang" <> wrote:

> Dear list:
> I have an ever-growing solr repository, and I need to process every single
> document to extract statistics. What would be a reasonable process that
> satifies the following properties:
> - Exhaustive: I have to traverse every single document
> - Incremental: in other words, it has to allow me to divide and conquer ---
> if I have processed the first 20k docs, next time I can start with 20001.
> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> fact, given that the processing will take very long, and the repository
> keeps growing, it is not even clear that the exhaustiveness is achieved.
> I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> yet. But I guess the same issues still hold even if I have the solr cloud
> environment, right, say in each shard?
> Any help would be greatly appreciated.
> Joe

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message