lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: deleteDocuments(Term... terms) takes a long time to do nothing.
Date Sat, 14 Dec 2013 22:13:56 GMT
It sounds like there are at least two issues.

First, that it takes so long to do the delete.

Unfortunately, deleting by Term is at heart a costly operation.  It
entails up to one disk seek per segment in your index; a custom
Directory impl that makes seeking costly would slow things down, or if
the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
impl is using the OS).  Is seeking somehow costly in your custom Dir
impl?

If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
per Term, which may actually be expected.

How many terms in your index?  Can you run CheckIndex and post the output?

You could index your ID field using MemoryPostingsFormat, which should
be a good speedup, but will consume more RAM.

Is it possible to delete by query instead?  Ie, create a query that
matches the 460K docs and pass that to
IndexWriter.deleteDocuments(Query).

Also, try passing fewer ids at once to Lucene, e.g. break the 460K
into smaller chunks.  Lucene buffers up all deleted terms from one
call, and then applies them, so my guess is you're using way too much
intermediate memory by passign 460K in a single call.

Instead of indexing everything into one index, and then deleting tons
of docs to "clone" to a new index, why not just index to two separate
indices to begin with?

The second issue is that after all that work, nothing in fact changed.
 For that, I think you should make a small test case that just tries
to delete one document, and iterate/debug until that works.  Your
StringField indexing line looks correct; make sure you're passing
precisely the same field name and value?  Make sure you're not
deleting already-deleted documents?  (Your for loop seems to ignore
already deleted documents).

Mike McCandless

http://blog.mikemccandless.com


On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin <jason.corekin@gmail.com> wrote:
> I knew that I had forgotten something.  Below is the line that I use to
> create the field that I am trying to use to delete the entries with.  I
> hope this avoids some confusion.  Thank you very much to anyone that takes
> the time to read these messages.
>
> doc.add(new StringField("FileName",filename, Field.Store.YES));
>
>
> On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <jason.corekin@gmail.com>wrote:
>
>> Let me start by stating that I almost certain that I am doing something
>> wrong, and that I hope that I am because if not there is a VERY large bug
>> in Lucene.   What I am trying to do is use the method
>>
>>
>> deleteDocuments(Term... terms)
>>
>>
>>  out of the IndexWriter class to delete several Term object Arrays, each
>> fed to it via a separate Thread.  Each array has around 460k+ Term object
>> in it.  The issue is that after running for around 30 minutes or more the
>> method finishes, I then have a commit run and nothing changes with my files.
>> To be fair, I am running a custom Directory implementation that might be
>> causing problems, but I do not think that this is the case as I do not even
>> see any of the my Directory methods in the stack trace.  In fact when I
>> set break points inside the delete methods of my Directory implementation
>> they never even get hit. To be clear replacing the custom Directory
>> implementation with a standard one is not an option due to the nature of
>> the data which is made up of terabytes of small (1k and less) files.  So,
>> if the issue is in the Directory implementation I have to figure out how to
>> fix it.
>>
>>
>> Below are the pieces of code that I think are relevant to this issue as
>> well as a copy of the stack trace thread that was doing work when I paused
>> the debug session.  As you are likely to notice, the thread is called a
>> DBCloner because it is being used to clone the underlying Index based
>> database (needed to avoid storing trillions of files directly on disk).  The
>> idea is to duplicate the selected group of terms into a new database and
>> then delete to original terms from the original database.  The duplicate
>> work wonderfully, but not matter what I do including cutting the program
>> down to one thread I cannot shrink the database and the time to try to do
>> the deletes takes drastically too long.
>>
>>
>> In an attempt to be as helpful as possible, I will say this.  I have been
>> tracing this problem for a few days and have seen that
>>
>> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>>
>> is where that majority of the execution time is spent.  I have also
>> noticed that this method return false MUCH more often than it returns true.
>> I have been trying to figure out how the mechanics of this process work
>> just in case the issue was not in my code and I might have been able  to
>> find the problem.  But I have yet to find the problem either in Lucene
>> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be doing
>> wrong, I would really appreciate reading what you have to say.  Thanks in
>> advance.
>>
>>
>>
>> Jason
>>
>>
>>
>>                 private void cloneDB() throws QueryNodeException {
>>
>>
>>
>>                                 Document doc;
>>
>>                                 ArrayList<String> fileNames;
>>
>>                                 int start = docRanges[(threadNumber * 2)];
>>
>>                                 int stop = docRanges[(threadNumber * 2) +
>> 1];
>>
>>
>>
>>                                 try {
>>
>>
>>
>>                                                 fileNames = new
>> ArrayList<String>(docsPerThread);
>>
>>                                                 for (int i = start; i <
>> stop; i++) {
>>
>>                                                                 doc =
>> searcher.doc(i);
>>
>>                                                                 try {
>>
>>
>> adder.addDoc(doc);
>>
>>
>> fileNames.add(doc.get("FileName"));
>>
>>                                                                 } catch
>> (TransactionExceptionRE | TransactionException | LockConflictException te) {
>>
>>
>> adder.txnAbort();
>>
>>
>> System.err.println(Thread.currentThread().getName() + ": Adding a message
>> failed, retrying.");
>>
>>                                                                 }
>>
>>                                                 }
>>
>>                                                 deleters[threadNumber].deleteTerms("FileName",
>> fileNames);
>>
>>
>> deleters[threadNumber].commit();
>>
>>
>>
>>                                 } catch (IOException | ParseException ex)
>> {
>>
>>                                                 Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
>> null, ex);
>>
>>                                 }
>>
>>                 }
>>
>>
>>
>>
>>
>>                                 public void deleteTerms(String
>> dbField,ArrayList<String> fieldTexts) throws IOException {
>>
>>                                 Term[] terms = new
>> Term[fieldTexts.size()];
>>
>>                                 for(int i=0;i<fieldTexts.size();i++){
>>
>>                                                 terms[i]= new
>> Term(dbField,fieldTexts.get(i));
>>
>>                                 }
>>
>>                                 writer.deleteDocuments(terms);
>>
>>                 }
>>
>>
>>
>>                 public void deleteDocuments(Term... terms) throws
>> IOException
>>
>>
>>
>>
>>
>>                 Thread [DB Cloner 2] (Suspended)
>>
>>                 owns: BufferedUpdatesStream  (id=54)
>>
>>                 owns: IndexWriter  (id=49)
>>
>>                 FST<T>.readFirstRealTargetArc(long, Arc<T>, BytesReader)
>> line: 979
>>
>>                 FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
>> line: 1220
>>
>>                 BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>> line: 1679
>>
>>                 BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
>> ReadersAndUpdates, SegmentReader) line: 414
>>
>>                 BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
>> List<SegmentCommitInfo>) line: 283
>>
>>                 IndexWriter.applyAllDeletesAndUpdates() line: 3112
>>
>>                 IndexWriter.applyDeletesAndPurge(boolean) line: 4641
>>
>>
>>                 DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
>> boolean, boolean) line: 673
>>
>>                 IndexWriter.processEvents(Queue<Event>, boolean, boolean)
>> line: 4665
>>
>>                 IndexWriter.processEvents(boolean, boolean) line: 4657
>>
>>
>>                 IndexWriter.deleteDocuments(Term...) line: 1421
>>
>>                 DocDeleter.deleteTerms(String, ArrayList<String>) line: 95
>>
>>
>>                 DBCloner.cloneDB() line: 233
>>
>>                 DBCloner.run() line: 133
>>
>>                 Thread.run() line: 744
>>
>>
>>
>>
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message