lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: deleteDocuments(Term... terms) takes a long time to do nothing.
Date Tue, 17 Dec 2013 15:34:50 GMT
OK I'm glad it's resolved.

Another way to handle the "expire old documents" would be to index
into separate indices by time, and use MultiReader to search all of
them.

E.g. maybe one index per day.  This way, to delete a day just means
you don't pass that index to MultiReader.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Dec 16, 2013 at 10:42 PM, Jason Corekin <jason.corekin@gmail.com> wrote:
> Mike,
>
>
>
> Thank you for your help.  Below are a few comments to directly reply to
> your questions, but in general your suggestions helped to get me on the
> right track and I believe that have been able to solve the Lucene component
> of my problems.  The short answer was that I when I had previously tried to
> search by query I used to filenames stored in each document as the query,
> which was essentially equivalent to deleting by term.  You email helped me
> to realize this and in turn change my query to be time range based, which
> now takes seconds to run.
>
>
>
> Thank You
>
>
>
> Jason Corekin
>
>
>
>>It sounds like there are at least two issues.
>
>>
>
>>First, that it takes so long to do the delete.
>
>>
>
>>Unfortunately, deleting by Term is at heart a costly operation.  It
>
>>entails up to one disk seek per segment in your index; a custom
>
>>Directory impl that makes seeking costly would slow things down, or if
>
>>the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
>
>>impl is using the OS).  Is seeking somehow costly in your custom Dir
>
>>impl?
>
>
>
> No, seeks are not slow at all.
>
>>
>
>>If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
>
>>per Term, which may actually be expected.
>
>>
>
>>How many terms in your index?  Can you run CheckIndex and post the output?
>
> In the main test case that was causing problems I believe that there are
> around 3.7million terms and this is tiny in comparison to what will need to
> be held.  Unfortunately I forgot to save the CheckIndex output that I
> created from this test set while the problem was occurring and now that the
> problem is solved I do not think it is worth going back to recreate it.
>
>
>
>>
>
>>You could index your ID field using MemoryPostingsFormat, which should
>
>>be a good speedup, but will consume more RAM.
>
>>
>
>>Is it possible to delete by query instead?  Ie, create a query that
>
>>matches the 460K docs and pass that to
>
>>IndexWriter.deleteDocuments(Query).
>
>>
>
> Thanks so much for this suggestion, I had thought of it on my own.
>
>
>
>>Also, try passing fewer ids at once to Lucene, e.g. break the 460K
>
>>into smaller chunks.  Lucene buffers up all deleted terms from one
>
>>call, and then applies them, so my guess is you're using way too much
>
>>intermediate memory by passign 460K in a single call.
>
>
>
> This does not seem to be the issue now, but I will keep it in mind.
>
>>
>
>>Instead of indexing everything into one index, and then deleting tons
>
>>of docs to "clone" to a new index, why not just index to two separate
>
>>indices to begin with?
>
>>
>
> The clone idea is only a test, the final design is to be able to copy date
> ranges of data out of the main index and into secondary indexes that will
> be backed up and removed from the main system on a regular interval.  This
> copy component of this idea seems to work just fine, it’s getting the
> deletion from the made index to work that is giving me all the trouble.
>
>
>
>>The second issue is that after all that work, nothing in fact changed.
>
>> For that, I think you should make a small test case that just tries
>
>>to delete one document, and iterate/debug until that works.  Your
>
>>StringField indexing line looks correct; make sure you're passing
>
>>precisely the same field name and value?  Make sure you're not
>
>>deleting already-deleted documents?  (Your for loop seems to ignore
>
>>already deleted documents).
>
>
>
> This was caused be in incorrect use of the underlying data structure.  This
> is partially fixed now and what I am currently working on.  I have this
> fixed enough to identify  that it should no longer be related to Lucene.
>
>
>
>>
>
>>Mike McCandless
>
>
> On Sat, Dec 14, 2013 at 5:58 PM, Jason Corekin <jason.corekin@gmail.com>wrote:
>
>> Mike,
>>
>> Thanks for the input, it will take me some time to digest and trying
>> everything you wrote about.  I will post back the answers to your questions
>> and results to from the suggestions you made once I have gone over
>> everything.  Thanks for the quick reply,
>>
>> Jason
>>
>>
>> On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> It sounds like there are at least two issues.
>>>
>>> First, that it takes so long to do the delete.
>>>
>>> Unfortunately, deleting by Term is at heart a costly operation.  It
>>> entails up to one disk seek per segment in your index; a custom
>>> Directory impl that makes seeking costly would slow things down, or if
>>> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
>>> impl is using the OS).  Is seeking somehow costly in your custom Dir
>>> impl?
>>>
>>> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
>>> per Term, which may actually be expected.
>>>
>>> How many terms in your index?  Can you run CheckIndex and post the output?
>>>
>>> You could index your ID field using MemoryPostingsFormat, which should
>>> be a good speedup, but will consume more RAM.
>>>
>>> Is it possible to delete by query instead?  Ie, create a query that
>>> matches the 460K docs and pass that to
>>> IndexWriter.deleteDocuments(Query).
>>>
>>> Also, try passing fewer ids at once to Lucene, e.g. break the 460K
>>> into smaller chunks.  Lucene buffers up all deleted terms from one
>>> call, and then applies them, so my guess is you're using way too much
>>> intermediate memory by passign 460K in a single call.
>>>
>>> Instead of indexing everything into one index, and then deleting tons
>>> of docs to "clone" to a new index, why not just index to two separate
>>> indices to begin with?
>>>
>>> The second issue is that after all that work, nothing in fact changed.
>>>  For that, I think you should make a small test case that just tries
>>> to delete one document, and iterate/debug until that works.  Your
>>> StringField indexing line looks correct; make sure you're passing
>>> precisely the same field name and value?  Make sure you're not
>>> deleting already-deleted documents?  (Your for loop seems to ignore
>>> already deleted documents).
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin <jason.corekin@gmail.com>
>>> wrote:
>>> > I knew that I had forgotten something.  Below is the line that I use to
>>> > create the field that I am trying to use to delete the entries with.  I
>>> > hope this avoids some confusion.  Thank you very much to anyone that
>>> takes
>>> > the time to read these messages.
>>> >
>>> > doc.add(new StringField("FileName",filename, Field.Store.YES));
>>> >
>>> >
>>> > On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <jason.corekin@gmail.com
>>> >wrote:
>>> >
>>> >> Let me start by stating that I almost certain that I am doing something
>>> >> wrong, and that I hope that I am because if not there is a VERY large
>>> bug
>>> >> in Lucene.   What I am trying to do is use the method
>>> >>
>>> >>
>>> >> deleteDocuments(Term... terms)
>>> >>
>>> >>
>>> >>  out of the IndexWriter class to delete several Term object Arrays,
>>> each
>>> >> fed to it via a separate Thread.  Each array has around 460k+ Term
>>> object
>>> >> in it.  The issue is that after running for around 30 minutes or more
>>> the
>>> >> method finishes, I then have a commit run and nothing changes with my
>>> files.
>>> >> To be fair, I am running a custom Directory implementation that might
>>> be
>>> >> causing problems, but I do not think that this is the case as I do not
>>> even
>>> >> see any of the my Directory methods in the stack trace.  In fact when
I
>>> >> set break points inside the delete methods of my Directory
>>> implementation
>>> >> they never even get hit. To be clear replacing the custom Directory
>>> >> implementation with a standard one is not an option due to the nature
>>> of
>>> >> the data which is made up of terabytes of small (1k and less) files.
>>>  So,
>>> >> if the issue is in the Directory implementation I have to figure out
>>> how to
>>> >> fix it.
>>> >>
>>> >>
>>> >> Below are the pieces of code that I think are relevant to this issue
as
>>> >> well as a copy of the stack trace thread that was doing work when I
>>> paused
>>> >> the debug session.  As you are likely to notice, the thread is called
a
>>> >> DBCloner because it is being used to clone the underlying Index based
>>> >> database (needed to avoid storing trillions of files directly on
>>> disk).  The
>>> >> idea is to duplicate the selected group of terms into a new database
>>> and
>>> >> then delete to original terms from the original database.  The
>>> duplicate
>>> >> work wonderfully, but not matter what I do including cutting the
>>> program
>>> >> down to one thread I cannot shrink the database and the time to try
to
>>> do
>>> >> the deletes takes drastically too long.
>>> >>
>>> >>
>>> >> In an attempt to be as helpful as possible, I will say this.  I have
>>> been
>>> >> tracing this problem for a few days and have seen that
>>> >>
>>> >> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>>> >>
>>> >> is where that majority of the execution time is spent.  I have also
>>> >> noticed that this method return false MUCH more often than it returns
>>> true.
>>> >> I have been trying to figure out how the mechanics of this process work
>>> >> just in case the issue was not in my code and I might have been able
>>>  to
>>> >> find the problem.  But I have yet to find the problem either in Lucene
>>> >> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be
>>> doing
>>> >> wrong, I would really appreciate reading what you have to say.  Thanks
>>> in
>>> >> advance.
>>> >>
>>> >>
>>> >>
>>> >> Jason
>>> >>
>>> >>
>>> >>
>>> >>                 private void cloneDB() throws QueryNodeException {
>>> >>
>>> >>
>>> >>
>>> >>                                 Document doc;
>>> >>
>>> >>                                 ArrayList<String> fileNames;
>>> >>
>>> >>                                 int start = docRanges[(threadNumber
*
>>> 2)];
>>> >>
>>> >>                                 int stop = docRanges[(threadNumber *
>>> 2) +
>>> >> 1];
>>> >>
>>> >>
>>> >>
>>> >>                                 try {
>>> >>
>>> >>
>>> >>
>>> >>                                                 fileNames = new
>>> >> ArrayList<String>(docsPerThread);
>>> >>
>>> >>                                                 for (int i = start;
i <
>>> >> stop; i++) {
>>> >>
>>> >>                                                                 doc
=
>>> >> searcher.doc(i);
>>> >>
>>> >>                                                                 try
{
>>> >>
>>> >>
>>> >> adder.addDoc(doc);
>>> >>
>>> >>
>>> >> fileNames.add(doc.get("FileName"));
>>> >>
>>> >>                                                                 } catch
>>> >> (TransactionExceptionRE | TransactionException | LockConflictException
>>> te) {
>>> >>
>>> >>
>>> >> adder.txnAbort();
>>> >>
>>> >>
>>> >> System.err.println(Thread.currentThread().getName() + ": Adding a
>>> message
>>> >> failed, retrying.");
>>> >>
>>> >>                                                                 }
>>> >>
>>> >>                                                 }
>>> >>
>>> >>
>>> deleters[threadNumber].deleteTerms("FileName",
>>> >> fileNames);
>>> >>
>>> >>
>>> >> deleters[threadNumber].commit();
>>> >>
>>> >>
>>> >>
>>> >>                                 } catch (IOException | ParseException
>>> ex)
>>> >> {
>>> >>
>>> >>
>>> Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
>>> >> null, ex);
>>> >>
>>> >>                                 }
>>> >>
>>> >>                 }
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>                                 public void deleteTerms(String
>>> >> dbField,ArrayList<String> fieldTexts) throws IOException {
>>> >>
>>> >>                                 Term[] terms = new
>>> >> Term[fieldTexts.size()];
>>> >>
>>> >>                                 for(int i=0;i<fieldTexts.size();i++){
>>> >>
>>> >>                                                 terms[i]= new
>>> >> Term(dbField,fieldTexts.get(i));
>>> >>
>>> >>                                 }
>>> >>
>>> >>                                 writer.deleteDocuments(terms);
>>> >>
>>> >>                 }
>>> >>
>>> >>
>>> >>
>>> >>                 public void deleteDocuments(Term... terms) throws
>>> >> IOException
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>                 Thread [DB Cloner 2] (Suspended)
>>> >>
>>> >>                 owns: BufferedUpdatesStream  (id=54)
>>> >>
>>> >>                 owns: IndexWriter  (id=49)
>>> >>
>>> >>                 FST<T>.readFirstRealTargetArc(long, Arc<T>,
>>> BytesReader)
>>> >> line: 979
>>> >>
>>> >>                 FST<T>.findTargetArc(int, Arc<T>, Arc<T>,
BytesReader)
>>> >> line: 1220
>>> >>
>>> >>
>>> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>>> >> line: 1679
>>> >>
>>> >>                 BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
>>> >> ReadersAndUpdates, SegmentReader) line: 414
>>> >>
>>> >>
>>> BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
>>> >> List<SegmentCommitInfo>) line: 283
>>> >>
>>> >>                 IndexWriter.applyAllDeletesAndUpdates() line: 3112
>>> >>
>>> >>                 IndexWriter.applyDeletesAndPurge(boolean) line: 4641
>>> >>
>>> >>
>>> >>                 DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
>>> >> boolean, boolean) line: 673
>>> >>
>>> >>                 IndexWriter.processEvents(Queue<Event>, boolean,
>>> boolean)
>>> >> line: 4665
>>> >>
>>> >>                 IndexWriter.processEvents(boolean, boolean) line: 4657
>>> >>
>>> >>
>>> >>                 IndexWriter.deleteDocuments(Term...) line: 1421
>>> >>
>>> >>                 DocDeleter.deleteTerms(String, ArrayList<String>)
>>> line: 95
>>> >>
>>> >>
>>> >>                 DBCloner.cloneDB() line: 233
>>> >>
>>> >>                 DBCloner.run() line: 133
>>> >>
>>> >>                 Thread.run() line: 744
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message