lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Unexpected returning false from IndexWriter.tryDeleteDocument
Date Sat, 21 Dec 2013 13:13:17 GMT
OK I see; so deleting by Term or Query is a no go.  I suppose, the
"retry" approach is actually fine: deleting by docID should be so fast
that having to retry if any single docID failed, is probably still
plenty fast.  Out of curiosity, if you have the numbers handy, how
much time does it take to do all of your deletions (when it succeeds)?

Maybe, try to prevent indexing so many duplicate documents in the
first place?  But I assume that's hard for some reason.

You could also make a FilterIndexReader subclass that filters out the
duplicates, and pass that to addIndices(IR[]) to build the new
de-duped index.  There is also DuplicatesFilter...

Indeed, I don't think tryDeleteDocument will ever trigger a new merge.
 But are you certain merges were not already running when you started?
 Maybe call IW.waitForMerges first?  And turn on the infoStream ...



Mike McCandless

http://blog.mikemccandless.com


On Fri, Dec 20, 2013 at 1:50 PM, Derek Lewis <derek@lewisd.com> wrote:
> I'll see if I can explain the scenario a bit simpler in a moment, but
> there's one other thing I thought worth mentioning.
>
> I'm not sure it's possible for me to fall back to Term/Query deleting.
> Basically, if there are two documents in the index that have the same
> serialId, it's as the result of the same thing being indexed twice, so all
> the terms are going to be the same.  If I understand right, the fallback
> method of deletion then delete all the identical documents.  I need to
> leave one (and only one) document in the index, for each serialId, so I
> think deleting by docId is my only option.
>
> A simpler (though incomplete) description of the scenario:
>
> I have an index containing a bunch of segments, with millions of documents,
> each with a unique ID in a docField.  However, due to some other
> conditions, I've ended up with some input documents indexed multiple
> (hundreds or more) times, with the same serialId.  I need to remove all
> those duplicates when I merge the indexes.
>
> The code I have that does this (samples in the original email) never
> explicitly adds any documents to the index, it just creates the writer from
> the reader, calls tryDeleteDocument probably millions of times, and then
> force merges everything.  Somewhere along this process, while I'm still
> doing the deletes, it appears a segment is being merged away.  I've walked
> through the code for tryDeleteDocument, and the things it calls, fairly
> deeply, and I can't figure out why it would be merging away segments.  I've
> tried creating some test scenarios, but I never see it happen.
>
>
> On Fri, Dec 20, 2013 at 10:28 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> I couldn't quite follow the scenario ... but if there's any chance a
>> merge could run in that IndexWriter it can lead to this.  Could it
>> just be a merge that was running already at the start of your deletion
>> process?
>>
>> Maybe turn on IndexWriter's infoStream to see what merges are kicking off?
>>
>> Really, your app should not consider this an "error" (it sounds like
>> it throws an exception and retries again later until it succeeds)...
>> it's better to delete those documents "the old fashioned way".
>> Relying on when IW starts/finishes merges is fragile (it's an
>> implementation detail...).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Dec 20, 2013 at 1:06 PM, Derek Lewis <derek@lewisd.com> wrote:
>> > Hi Mike,
>> >
>> > Thanks for the response.  I realize that merging could cause segments to
>> be
>> > deleted, resulting in tryDeleteDocument returning false.  However, I've
>> > been unable to figure out why the scenario I've described would cause
>> > segments to be merged.  I've tried duplicating it by writing indexes with
>> > many segments and deleting all the documents in them, but I haven't had
>> any
>> > luck.
>> >
>> > Can you suggest any ways the scenario I've outlined would cause merges?
>> >
>> > Cheers,
>> > Derek
>> >
>> >
>> > On Fri, Dec 20, 2013 at 9:50 AM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> tryDeleteDocument will return false if the IndexReader is "stale",
>> >> i.e. the segment that contains the docID you are trying to delete has
>> >> been merged by IndexWriter.
>> >>
>> >> In this case you need to fallback to deleting by Term/Query.
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Fri, Dec 20, 2013 at 12:12 PM, Derek Lewis <derek@lewisd.com> wrote:
>> >> > Hello,
>> >> >
>> >> > I have a problem where IndexWriter.tryDeleteDocument is returning
>> false
>> >> > unexpectedly.  Unfortunately, it's in production, on indexes that have
>> >> > since been merged and shunted around all over, and I've been unable
to
>> >> > create a scenario that duplicates the problem in any development
>> >> > environments.  It also means I haven't been able to find out exact
>> >> details
>> >> > about the scenario, so some of this is extrapolation.
>> >> >
>> >> > The basic scenario is, I think,  this:
>> >> > There is a Lucene index with millions of documents, and a bunch of
>> >> segments.
>> >> > Each of the documents has an associated "serialId" stored.  There are
>> >> many
>> >> > many duplicates, due to a transient error that occurred.
>> >> > Our system attempts to perform a process whereby it merges the index
>> >> > segments, and deletes the documents with duplicate serialIds, so that
>> at
>> >> > the end of the process, we have only one segment, and for each
>> serialId
>> >> > there is only one document.
>> >> >
>> >> > We have an IndexWriter we created with:
>> >> > writer = new IndexWriter(
>> >> >                     FSDirectory.open(indexdir),
>> >> >                     config);
>> >> >
>> >> > We create a DirectoryReader:
>> >> > final DirectoryReader nearRealtimeReader =
>> DirectoryReader.open(writer,
>> >> > false);
>> >> >
>> >> > which we use to iterate over the documents with:
>> >> > for (int docId = 0; docId < nearRealtimeReader.maxDoc(); ++docId)
{
>> >> >
>> >> > For any document who's serialId indicates it's a duplicate (ie. we've
>> >> > already seen that serialId), we delete it:
>> >> > final boolean deletionSuccessful =
>> >> > writer.tryDeleteDocument(nearRealtimeReader, docId);
>> >> >
>> >> > This works the vast majority of the time, however, in this case I
>> haven't
>> >> > been able to reproduce, it returns false, which we check, and throw
an
>> >> > exception.
>> >> >
>> >> > What I found particularly interesting is that when our system
>> >> re-schedules
>> >> > this process and tries again, it eventually succeeds, despite nothing
>> >> else
>> >> > in our system writing to this index in the meantime. (Before indexes
>> are
>> >> > shunted off to this merging process, they're "closed" to the rest of
>> the
>> >> > system)  This seems to hint to me that maybe something is merging the
>> >> > segments of this index, even though we throw and exception before we
>> get
>> >> to
>> >> > the part of our code that calls:
>> >> > writer.forceMerge(1, true);
>> >> > writer.commit();
>> >> >
>> >> > Any ideas as to why this might be happening?
>> >> >
>> >> > We're using Lucene 4.4.0, on Java 7 64-bit, on Solaris.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message