lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: solr keep old docs
Date Wed, 28 Dec 2011 11:46:15 GMT
the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old
docs. I have tried it already.

Best Regards
Alexander Aristov


On 28 December 2011 13:04, Lance Norskog <goksron@gmail.com> wrote:

> The SignatureUpdateProcessor is for exactly this problem:
>
>
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
>
> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
> <alexander.aristov@gmail.com> wrote:
> > I get docs from external sources and the only place I keep them is solr
> > index. I have no a database or other means to track indexed docs (my
> > personal oppinion is that it might be a huge headache).
> >
> > Some docs might change slightly in there original sources but I don't
> need
> > that changes. In fact I need original data only.
> >
> > So I have no other ways but to either check if a document is already in
> > index before I put it to solrj array (read - query solr) or develop my
> own
> > update chain processor and implement ID check there and skip such docs.
> >
> > Maybe it's wrong place to aguee and probably it's been discussed before
> but
> > I wonder why simple the overwrite parameter doesn't work here.
> >
> > My oppinion it perfectly suits here. In combination with unique ID it can
> > cover all possible variants.
> >
> > cases:
> >
> > 1. overwrite=true and uniquID exists then newer doc should overwrite the
> > old one.
> >
> > 2. overwrite=false and uniqueID exists then newer doc must be skipped
> since
> > old exists.
> >
> > 3. uniqueID doesn't exist then newer doc just gets added regardless if
> old
> > exists or not.
> >
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 27 December 2011 22:53, Erick Erickson <erickerickson@gmail.com>
> wrote:
> >
> >> Mikhail is right as far as I know, the assumption built into Solr is
> that
> >> duplicate IDs (when <uniqueKey> is defined) should trigger the old
> >> document to be replaced.
> >>
> >> what is your system-of-record? By that I mean what does your SolrJ
> >> program do to send data to Solr? Is there any way you could just
> >> *not* send documents that are already in the Solr index based on,
> >> for instance, any timestamp associated with your system-of-record
> >> and the last time you did an incremental index?
> >>
> >> Best
> >> Erick
> >>
> >> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
> >> <alexander.aristov@gmail.com> wrote:
> >> > Hi
> >> >
> >> > I am not using database. All needed data is in solr index that's why I
> >> want
> >> > to skip excessive checks.
> >> >
> >> > I will check DIH but not sure if it helps.
> >> >
> >> > I am fluent with Java and it's not a problem for me to write a class
> or
> >> so
> >> > but I want to check first  maybe there are any ways (workarounds) to
> make
> >> > it working without codding, just by playing around with configuration
> and
> >> > params. I don't want to go away from default solr implementation.
> >> >
> >> > Best Regards
> >> > Alexander Aristov
> >> >
> >> >
> >> > On 27 December 2011 09:33, Mikhail Khludnev <
> mkhludnev@griddynamics.com
> >> >wrote:
> >> >
> >> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
> >> >> alexander.aristov@gmail.com> wrote:
> >> >>
> >> >> > Hi people,
> >> >> >
> >> >> > I urgently need your help!
> >> >> >
> >> >> > I have solr 3.3 configured and running. I do uncremental indexing
4
> >> >> times a
> >> >> > day using bulk updates. Some documents are identical to some extent
> >> and I
> >> >> > wish to skip them, not to index.
> >> >> > But here is the problem as I could not find a way to tell solr
> ignore
> >> new
> >> >> > duplicate docs and keep old indexed docs. I don't care that it's
> new.
> >> >> Just
> >> >> > determine by ID that such document is in the index already and
> that's
> >> it.
> >> >> >
> >> >> > I use solrj for indexing. I have tried setting overwrite=false
and
> >> dedupe
> >> >> > apprache but nothing helped me. I either have that a newer doc
> >> overwrites
> >> >> > old one or I get duplicate.
> >> >> >
> >> >> > I think it's a very simple and basic feature and it must exist.
> What
> >> did
> >> >> I
> >> >> > make wrong or didn't do?
> >> >> >
> >> >>
> >> >> I guess, because  the mainstream approach is delta-import , when you
> >> have
> >> >> "updated" timestamps in your DB and "last-import" timestamp stored
> >> >> somewhere. You can check how it works in DIH.
> >> >>
> >> >>
> >> >> >
> >> >> > Tried google but I couldn't find a solution there althoght many
> people
> >> >> > encounted such problem.
> >> >> >
> >> >> >
> >> >> it's definitely can be done by overriding
> >> >> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I
> >> suggest
> >> >> to start from implementing your own
> >> >> http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK,
> >> bypass
> >> >> chain call if it's found. Then if you meet performance issues on
> >> querying
> >> >> your PKs one by one, (but only after that) you can batch your
> searches,
> >> >> there are couple of optimization techniques for huge disjunction
> queries
> >> >> like PK:(2 OR 4 OR 5 OR 6).
> >> >>
> >> >>
> >> >> > I start considering that I must query index to check if a doc
to be
> >> added
> >> >> > is in the index already and do not add it to array but I have
so
> many
> >> >> docs
> >> >> > that I am affraid it's not a good solution.
> >> >> >
> >> >> > Best Regards
> >> >> > Alexander Aristov
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Sincerely yours
> >> >> Mikhail Khludnev
> >> >> Lucid Certified
> >> >> Apache Lucene/Solr Developer
> >> >> Grid Dynamics
> >> >>
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message