lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: solr keep old docs
Date Wed, 28 Dec 2011 14:45:19 GMT
Thanks Eric,

it sets me direction. I will be writing new plugin and will get back to the
dev forum with results and then we will decide next steps.

Best Regards
Alexander Aristov


On 28 December 2011 18:08, Erick Erickson <erickerickson@gmail.com> wrote:

> Well, the short answer is that nobody else has
> 1> had a similar requirement
> AND
> 2> not found a suitable work around
> AND
> 3> implemented the change and contributed it back.
>
> So, if you'd like to volunteer <G>.....
>
> Seriously. If you think this would be valuable and are
> willing to work on it, hop on over to the dev list and
> discuss it, open a JIRA and make it work. I'd start
> by opening a discussion on the dev list before
> opening a JIRA, just to get a sense of where the
> snags would be to changing the Solr code, but that's
> optional.
>
> That said, writing your own update request handler
> that detected this case isn't very difficult,
> extend UpdateRequestProcessorFactory/UpdateRequestProcessor
> and use it as a plugin.
>
> Best
> Erick
>
> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
> <alexander.aristov@gmail.com> wrote:
> > the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES
> old
> > docs. I have tried it already.
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 28 December 2011 13:04, Lance Norskog <goksron@gmail.com> wrote:
> >
> >> The SignatureUpdateProcessor is for exactly this problem:
> >>
> >>
> >>
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
> >>
> >> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
> >> <alexander.aristov@gmail.com> wrote:
> >> > I get docs from external sources and the only place I keep them is
> solr
> >> > index. I have no a database or other means to track indexed docs (my
> >> > personal oppinion is that it might be a huge headache).
> >> >
> >> > Some docs might change slightly in there original sources but I don't
> >> need
> >> > that changes. In fact I need original data only.
> >> >
> >> > So I have no other ways but to either check if a document is already
> in
> >> > index before I put it to solrj array (read - query solr) or develop my
> >> own
> >> > update chain processor and implement ID check there and skip such
> docs.
> >> >
> >> > Maybe it's wrong place to aguee and probably it's been discussed
> before
> >> but
> >> > I wonder why simple the overwrite parameter doesn't work here.
> >> >
> >> > My oppinion it perfectly suits here. In combination with unique ID it
> can
> >> > cover all possible variants.
> >> >
> >> > cases:
> >> >
> >> > 1. overwrite=true and uniquID exists then newer doc should overwrite
> the
> >> > old one.
> >> >
> >> > 2. overwrite=false and uniqueID exists then newer doc must be skipped
> >> since
> >> > old exists.
> >> >
> >> > 3. uniqueID doesn't exist then newer doc just gets added regardless if
> >> old
> >> > exists or not.
> >> >
> >> >
> >> > Best Regards
> >> > Alexander Aristov
> >> >
> >> >
> >> > On 27 December 2011 22:53, Erick Erickson <erickerickson@gmail.com>
> >> wrote:
> >> >
> >> >> Mikhail is right as far as I know, the assumption built into Solr is
> >> that
> >> >> duplicate IDs (when <uniqueKey> is defined) should trigger the
old
> >> >> document to be replaced.
> >> >>
> >> >> what is your system-of-record? By that I mean what does your SolrJ
> >> >> program do to send data to Solr? Is there any way you could just
> >> >> *not* send documents that are already in the Solr index based on,
> >> >> for instance, any timestamp associated with your system-of-record
> >> >> and the last time you did an incremental index?
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
> >> >> <alexander.aristov@gmail.com> wrote:
> >> >> > Hi
> >> >> >
> >> >> > I am not using database. All needed data is in solr index that's
> why I
> >> >> want
> >> >> > to skip excessive checks.
> >> >> >
> >> >> > I will check DIH but not sure if it helps.
> >> >> >
> >> >> > I am fluent with Java and it's not a problem for me to write a
> class
> >> or
> >> >> so
> >> >> > but I want to check first  maybe there are any ways (workarounds)
> to
> >> make
> >> >> > it working without codding, just by playing around with
> configuration
> >> and
> >> >> > params. I don't want to go away from default solr implementation.
> >> >> >
> >> >> > Best Regards
> >> >> > Alexander Aristov
> >> >> >
> >> >> >
> >> >> > On 27 December 2011 09:33, Mikhail Khludnev <
> >> mkhludnev@griddynamics.com
> >> >> >wrote:
> >> >> >
> >> >> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
> >> >> >> alexander.aristov@gmail.com> wrote:
> >> >> >>
> >> >> >> > Hi people,
> >> >> >> >
> >> >> >> > I urgently need your help!
> >> >> >> >
> >> >> >> > I have solr 3.3 configured and running. I do uncremental
> indexing 4
> >> >> >> times a
> >> >> >> > day using bulk updates. Some documents are identical
to some
> extent
> >> >> and I
> >> >> >> > wish to skip them, not to index.
> >> >> >> > But here is the problem as I could not find a way to
tell solr
> >> ignore
> >> >> new
> >> >> >> > duplicate docs and keep old indexed docs. I don't care
that it's
> >> new.
> >> >> >> Just
> >> >> >> > determine by ID that such document is in the index already
and
> >> that's
> >> >> it.
> >> >> >> >
> >> >> >> > I use solrj for indexing. I have tried setting overwrite=false
> and
> >> >> dedupe
> >> >> >> > apprache but nothing helped me. I either have that a
newer doc
> >> >> overwrites
> >> >> >> > old one or I get duplicate.
> >> >> >> >
> >> >> >> > I think it's a very simple and basic feature and it must
exist.
> >> What
> >> >> did
> >> >> >> I
> >> >> >> > make wrong or didn't do?
> >> >> >> >
> >> >> >>
> >> >> >> I guess, because  the mainstream approach is delta-import
, when
> you
> >> >> have
> >> >> >> "updated" timestamps in your DB and "last-import" timestamp
stored
> >> >> >> somewhere. You can check how it works in DIH.
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> > Tried google but I couldn't find a solution there althoght
many
> >> people
> >> >> >> > encounted such problem.
> >> >> >> >
> >> >> >> >
> >> >> >> it's definitely can be done by overriding
> >> >> >> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand),
but I
> >> >> suggest
> >> >> >> to start from implementing your own
> >> >> >> http://wiki.apache.org/solr/UpdateRequestProcessor - search
for
> PK,
> >> >> bypass
> >> >> >> chain call if it's found. Then if you meet performance issues
on
> >> >> querying
> >> >> >> your PKs one by one, (but only after that) you can batch your
> >> searches,
> >> >> >> there are couple of optimization techniques for huge disjunction
> >> queries
> >> >> >> like PK:(2 OR 4 OR 5 OR 6).
> >> >> >>
> >> >> >>
> >> >> >> > I start considering that I must query index to check
if a doc
> to be
> >> >> added
> >> >> >> > is in the index already and do not add it to array but
I have so
> >> many
> >> >> >> docs
> >> >> >> > that I am affraid it's not a good solution.
> >> >> >> >
> >> >> >> > Best Regards
> >> >> >> > Alexander Aristov
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Sincerely yours
> >> >> >> Mikhail Khludnev
> >> >> >> Lucid Certified
> >> >> >> Apache Lucene/Solr Developer
> >> >> >> Grid Dynamics
> >> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message