lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: problems with bulk indexing with concurrent DIH
Date Mon, 08 Aug 2016 14:43:48 GMT
On 8/2/2016 7:50 AM, Bernd Fehling wrote:
> Only assumption so far, DIH is sending the records as "update" (and
> not pure "add") to the indexer which will generate delete files during
> merge. If the number of segments is high it will take quite long to
> merge and check all records of all segments.

It's not DIH that's handling the requests as "update", it's Solr.  If
you index a document with the same value in the uniqueKey field as a
document that already exists in the index, Solr will delete the old one
before it adds the new one.  This applies to ANY indexing, not just
DIH.  This is how Solr is designed to work -- that's the entire point of
having a uniqueKey.

I'm not familiar with how a large number of deletes affects merging.  I
would not expect it to have much of a performance impact, and it might
in fact make merging faster, because I'd think that deleted docs would
be skipped.

Turning overwrite off when you are indexing would mean that Solr's
uniqueKey guarantee is lost.  You can end up with duplicate documents in
the Lucene index, and because merging can completely change internal
identifiers, there may be no built-in way for Solr or Lucene to
automatically determine which ones are old or new.

I didn't know about LUCENE-6161.  That looks like a nasty bug.

Thanks,
Shawn


Mime
View raw message