lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aditya <findbestopensou...@gmail.com>
Subject Re: processing documents in solr
Date Mon, 29 Jul 2013 06:51:38 GMT
Hi,

The easiest solution would be to have timestamp indexed. Is there any issue
in doing re-indexing?
If you want to process records in batch then you need a ordered list and a
bookmark. You require a field to sort and maintain a counter / last id as
bookmark. This is mandatory to solve your problem.

If you don't want to re-index, then you need to maintain information
related to visited nodes. Have a database / solr core which maintains list
of IDs which already processed. Fetch record from Solr, For each record,
check the new DB, if the record is already processed.

Regards
Aditya
www.findbestopensource.com





On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang <smartagent@gmail.com> wrote:

> Basically, I was thinking about running a range query like Shawn suggested
> on the tstamp field, but unfortunately it was not indexed. Range queries
> only work on indexed fields, right?
>
>
> On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang <smartagent@gmail.com> wrote:
>
> > I've been thinking about tstamp solution int the past few days. but too
> > bad, the field is avaialble but not indexed...
> >
> > I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
> > counter value. If yes, that would be equivalent to an autoincrement id.
> I'm
> > indexing from Nutch though; don't know how to feed in such counter...
> >
> >
> > On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> Why wouldn't a simple timestamp work for the ordering? Although
> >> I guess "simple timestamp" isn't really simple if the time settings
> >> change.
> >>
> >> So how about a simple counter field in your documents? Assuming
> >> you're indexing from SolrJ, your setup is to query q=*:*&sort=counter
> >> desc.
> >> Take the counter from the first document returned. Increment for
> >> each doc for the life of the indexing run. Now you've got, for all
> intents
> >> and purposes, an identity field albeit manually maintained.
> >>
> >> Then use your counter field as Shawn suggests for pulling all the
> >> data out.
> >>
> >> FWIW,
> >> Erick
> >>
> >> On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
> >> <mcucchiara@apache.org> wrote:
> >> > In both cases, for better performance, first I'd load just all the
> IDs,
> >> > after, during processing I'd load each document.
> >> > For what concern the incremental requirement, it should not be
> >> difficult to
> >> > write an hash function which maps a non-numerical I'd to a value.
> >> >  On Jul 27, 2013 7:03 AM, "Joe Zhang" <smartagent@gmail.com> wrote:
> >> >
> >> >> Dear list:
> >> >>
> >> >> I have an ever-growing solr repository, and I need to process every
> >> single
> >> >> document to extract statistics. What would be a reasonable process
> that
> >> >> satifies the following properties:
> >> >>
> >> >> - Exhaustive: I have to traverse every single document
> >> >> - Incremental: in other words, it has to allow me to divide and
> >> conquer ---
> >> >> if I have processed the first 20k docs, next time I can start with
> >> 20001.
> >> >>
> >> >> A simple "*:*" query would satisfy the 1st but not the 2nd property.
> In
> >> >> fact, given that the processing will take very long, and the
> repository
> >> >> keeps growing, it is not even clear that the exhaustiveness is
> >> achieved.
> >> >>
> >> >> I'm running solr 3.6.2 in a single-machine setting; no hadoop
> >> capability
> >> >> yet. But I guess the same issues still hold even if I have the solr
> >> cloud
> >> >> environment, right, say in each shard?
> >> >>
> >> >> Any help would be greatly appreciated.
> >> >>
> >> >> Joe
> >> >>
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message