lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Zhang <smartag...@gmail.com>
Subject Re: processing documents in solr
Date Mon, 29 Jul 2013 16:43:43 GMT
I'll try reindexing the timestamp.

The id-creation approach suggested by Erick sounds attractive, but the
nutch/solr integration seems rather tight. I don't where to break in to
insert the id into solr.


On Mon, Jul 29, 2013 at 4:11 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> No SolrJ doesn't provide this automatically. You'd be providing the
> counter by inserting it into the document as you created new docs.
>
> You could do this with any kind of document creation you are
> using.
>
> Best
> Erick
>
> On Mon, Jul 29, 2013 at 2:51 AM, Aditya <findbestopensource@gmail.com>
> wrote:
> > Hi,
> >
> > The easiest solution would be to have timestamp indexed. Is there any
> issue
> > in doing re-indexing?
> > If you want to process records in batch then you need a ordered list and
> a
> > bookmark. You require a field to sort and maintain a counter / last id as
> > bookmark. This is mandatory to solve your problem.
> >
> > If you don't want to re-index, then you need to maintain information
> > related to visited nodes. Have a database / solr core which maintains
> list
> > of IDs which already processed. Fetch record from Solr, For each record,
> > check the new DB, if the record is already processed.
> >
> > Regards
> > Aditya
> > www.findbestopensource.com
> >
> >
> >
> >
> >
> > On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang <smartagent@gmail.com>
> wrote:
> >
> >> Basically, I was thinking about running a range query like Shawn
> suggested
> >> on the tstamp field, but unfortunately it was not indexed. Range queries
> >> only work on indexed fields, right?
> >>
> >>
> >> On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang <smartagent@gmail.com>
> wrote:
> >>
> >> > I've been thinking about tstamp solution int the past few days. but
> too
> >> > bad, the field is avaialble but not indexed...
> >> >
> >> > I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
> >> > counter value. If yes, that would be equivalent to an autoincrement
> id.
> >> I'm
> >> > indexing from Nutch though; don't know how to feed in such counter...
> >> >
> >> >
> >> > On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson <
> erickerickson@gmail.com
> >> >wrote:
> >> >
> >> >> Why wouldn't a simple timestamp work for the ordering? Although
> >> >> I guess "simple timestamp" isn't really simple if the time settings
> >> >> change.
> >> >>
> >> >> So how about a simple counter field in your documents? Assuming
> >> >> you're indexing from SolrJ, your setup is to query q=*:*&sort=counter
> >> >> desc.
> >> >> Take the counter from the first document returned. Increment for
> >> >> each doc for the life of the indexing run. Now you've got, for all
> >> intents
> >> >> and purposes, an identity field albeit manually maintained.
> >> >>
> >> >> Then use your counter field as Shawn suggests for pulling all the
> >> >> data out.
> >> >>
> >> >> FWIW,
> >> >> Erick
> >> >>
> >> >> On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
> >> >> <mcucchiara@apache.org> wrote:
> >> >> > In both cases, for better performance, first I'd load just all
the
> >> IDs,
> >> >> > after, during processing I'd load each document.
> >> >> > For what concern the incremental requirement, it should not be
> >> >> difficult to
> >> >> > write an hash function which maps a non-numerical I'd to a value.
> >> >> >  On Jul 27, 2013 7:03 AM, "Joe Zhang" <smartagent@gmail.com>
> wrote:
> >> >> >
> >> >> >> Dear list:
> >> >> >>
> >> >> >> I have an ever-growing solr repository, and I need to process
> every
> >> >> single
> >> >> >> document to extract statistics. What would be a reasonable
process
> >> that
> >> >> >> satifies the following properties:
> >> >> >>
> >> >> >> - Exhaustive: I have to traverse every single document
> >> >> >> - Incremental: in other words, it has to allow me to divide
and
> >> >> conquer ---
> >> >> >> if I have processed the first 20k docs, next time I can start
with
> >> >> 20001.
> >> >> >>
> >> >> >> A simple "*:*" query would satisfy the 1st but not the 2nd
> property.
> >> In
> >> >> >> fact, given that the processing will take very long, and the
> >> repository
> >> >> >> keeps growing, it is not even clear that the exhaustiveness
is
> >> >> achieved.
> >> >> >>
> >> >> >> I'm running solr 3.6.2 in a single-machine setting; no hadoop
> >> >> capability
> >> >> >> yet. But I guess the same issues still hold even if I have
the
> solr
> >> >> cloud
> >> >> >> environment, right, say in each shard?
> >> >> >>
> >> >> >> Any help would be greatly appreciated.
> >> >> >>
> >> >> >> Joe
> >> >> >>
> >> >>
> >> >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message