lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Zhang <smartag...@gmail.com>
Subject Re: processing documents in solr
Date Mon, 29 Jul 2013 04:49:06 GMT
I've been thinking about tstamp solution int the past few days. but too
bad, the field is avaialble but not indexed...

I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
counter value. If yes, that would be equivalent to an autoincrement id. I'm
indexing from Nutch though; don't know how to feed in such counter...


On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> Why wouldn't a simple timestamp work for the ordering? Although
> I guess "simple timestamp" isn't really simple if the time settings
> change.
>
> So how about a simple counter field in your documents? Assuming
> you're indexing from SolrJ, your setup is to query q=*:*&sort=counter desc.
> Take the counter from the first document returned. Increment for
> each doc for the life of the indexing run. Now you've got, for all intents
> and purposes, an identity field albeit manually maintained.
>
> Then use your counter field as Shawn suggests for pulling all the
> data out.
>
> FWIW,
> Erick
>
> On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
> <mcucchiara@apache.org> wrote:
> > In both cases, for better performance, first I'd load just all the IDs,
> > after, during processing I'd load each document.
> > For what concern the incremental requirement, it should not be difficult
> to
> > write an hash function which maps a non-numerical I'd to a value.
> >  On Jul 27, 2013 7:03 AM, "Joe Zhang" <smartagent@gmail.com> wrote:
> >
> >> Dear list:
> >>
> >> I have an ever-growing solr repository, and I need to process every
> single
> >> document to extract statistics. What would be a reasonable process that
> >> satifies the following properties:
> >>
> >> - Exhaustive: I have to traverse every single document
> >> - Incremental: in other words, it has to allow me to divide and conquer
> ---
> >> if I have processed the first 20k docs, next time I can start with
> 20001.
> >>
> >> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> >> fact, given that the processing will take very long, and the repository
> >> keeps growing, it is not even clear that the exhaustiveness is achieved.
> >>
> >> I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> >> yet. But I guess the same issues still hold even if I have the solr
> cloud
> >> environment, right, say in each shard?
> >>
> >> Any help would be greatly appreciated.
> >>
> >> Joe
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message