lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: solr keep old docs
Date Tue, 27 Dec 2011 11:38:43 GMT
Hi

I am not using database. All needed data is in solr index that's why I want
to skip excessive checks.

I will check DIH but not sure if it helps.

I am fluent with Java and it's not a problem for me to write a class or so
but I want to check first  maybe there are any ways (workarounds) to make
it working without codding, just by playing around with configuration and
params. I don't want to go away from default solr implementation.

Best Regards
Alexander Aristov


On 27 December 2011 09:33, Mikhail Khludnev <mkhludnev@griddynamics.com>wrote:

> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
> alexander.aristov@gmail.com> wrote:
>
> > Hi people,
> >
> > I urgently need your help!
> >
> > I have solr 3.3 configured and running. I do uncremental indexing 4
> times a
> > day using bulk updates. Some documents are identical to some extent and I
> > wish to skip them, not to index.
> > But here is the problem as I could not find a way to tell solr ignore new
> > duplicate docs and keep old indexed docs. I don't care that it's new.
> Just
> > determine by ID that such document is in the index already and that's it.
> >
> > I use solrj for indexing. I have tried setting overwrite=false and dedupe
> > apprache but nothing helped me. I either have that a newer doc overwrites
> > old one or I get duplicate.
> >
> > I think it's a very simple and basic feature and it must exist. What did
> I
> > make wrong or didn't do?
> >
>
> I guess, because  the mainstream approach is delta-import , when you have
> "updated" timestamps in your DB and "last-import" timestamp stored
> somewhere. You can check how it works in DIH.
>
>
> >
> > Tried google but I couldn't find a solution there althoght many people
> > encounted such problem.
> >
> >
> it's definitely can be done by overriding
> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest
> to start from implementing your own
> http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass
> chain call if it's found. Then if you meet performance issues on querying
> your PKs one by one, (but only after that) you can batch your searches,
> there are couple of optimization techniques for huge disjunction queries
> like PK:(2 OR 4 OR 5 OR 6).
>
>
> > I start considering that I must query index to check if a doc to be added
> > is in the index already and do not add it to array but I have so many
> docs
> > that I am affraid it's not a good solution.
> >
> > Best Regards
> > Alexander Aristov
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message