lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: solr keep old docs
Date Tue, 27 Dec 2011 18:53:26 GMT
Mikhail is right as far as I know, the assumption built into Solr is that
duplicate IDs (when <uniqueKey> is defined) should trigger the old
document to be replaced.

what is your system-of-record? By that I mean what does your SolrJ
program do to send data to Solr? Is there any way you could just
*not* send documents that are already in the Solr index based on,
for instance, any timestamp associated with your system-of-record
and the last time you did an incremental index?

Best
Erick

On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
<alexander.aristov@gmail.com> wrote:
> Hi
>
> I am not using database. All needed data is in solr index that's why I want
> to skip excessive checks.
>
> I will check DIH but not sure if it helps.
>
> I am fluent with Java and it's not a problem for me to write a class or so
> but I want to check first  maybe there are any ways (workarounds) to make
> it working without codding, just by playing around with configuration and
> params. I don't want to go away from default solr implementation.
>
> Best Regards
> Alexander Aristov
>
>
> On 27 December 2011 09:33, Mikhail Khludnev <mkhludnev@griddynamics.com>wrote:
>
>> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
>> alexander.aristov@gmail.com> wrote:
>>
>> > Hi people,
>> >
>> > I urgently need your help!
>> >
>> > I have solr 3.3 configured and running. I do uncremental indexing 4
>> times a
>> > day using bulk updates. Some documents are identical to some extent and I
>> > wish to skip them, not to index.
>> > But here is the problem as I could not find a way to tell solr ignore new
>> > duplicate docs and keep old indexed docs. I don't care that it's new.
>> Just
>> > determine by ID that such document is in the index already and that's it.
>> >
>> > I use solrj for indexing. I have tried setting overwrite=false and dedupe
>> > apprache but nothing helped me. I either have that a newer doc overwrites
>> > old one or I get duplicate.
>> >
>> > I think it's a very simple and basic feature and it must exist. What did
>> I
>> > make wrong or didn't do?
>> >
>>
>> I guess, because  the mainstream approach is delta-import , when you have
>> "updated" timestamps in your DB and "last-import" timestamp stored
>> somewhere. You can check how it works in DIH.
>>
>>
>> >
>> > Tried google but I couldn't find a solution there althoght many people
>> > encounted such problem.
>> >
>> >
>> it's definitely can be done by overriding
>> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest
>> to start from implementing your own
>> http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass
>> chain call if it's found. Then if you meet performance issues on querying
>> your PKs one by one, (but only after that) you can batch your searches,
>> there are couple of optimization techniques for huge disjunction queries
>> like PK:(2 OR 4 OR 5 OR 6).
>>
>>
>> > I start considering that I must query index to check if a doc to be added
>> > is in the index already and do not add it to array but I have so many
>> docs
>> > that I am affraid it's not a good solution.
>> >
>> > Best Regards
>> > Alexander Aristov
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Lucid Certified
>> Apache Lucene/Solr Developer
>> Grid Dynamics
>>

Mime
View raw message