lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Corekin <jason.core...@gmail.com>
Subject deleteDocuments(Term... terms) takes a long time to do nothing.
Date Sat, 14 Dec 2013 06:28:59 GMT
Let me start by stating that I almost certain that I am doing something
wrong, and that I hope that I am because if not there is a VERY large bug
in Lucene.   What I am trying to do is use the method


deleteDocuments(Term... terms)


 out of the IndexWriter class to delete several Term object Arrays, each
fed to it via a separate Thread.  Each array has around 460k+ Term object
in it.  The issue is that after running for around 30 minutes or more the
method finishes, I then have a commit run and nothing changes with my files.
To be fair, I am running a custom Directory implementation that might be
causing problems, but I do not think that this is the case as I do not even
see any of the my Directory methods in the stack trace.  In fact when I set
break points inside the delete methods of my Directory implementation they
never even get hit. To be clear replacing the custom Directory
implementation with a standard one is not an option due to the nature of
the data which is made up of terabytes of small (1k and less) files.  So,
if the issue is in the Directory implementation I have to figure out how to
fix it.


Below are the pieces of code that I think are relevant to this issue as
well as a copy of the stack trace thread that was doing work when I paused
the debug session.  As you are likely to notice, the thread is called a
DBCloner because it is being used to clone the underlying Index based
database (needed to avoid storing trillions of files directly on disk).  The
idea is to duplicate the selected group of terms into a new database and
then delete to original terms from the original database.  The duplicate
work wonderfully, but not matter what I do including cutting the program
down to one thread I cannot shrink the database and the time to try to do
the deletes takes drastically too long.


In an attempt to be as helpful as possible, I will say this.  I have been
tracing this problem for a few days and have seen that

BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)

is where that majority of the execution time is spent.  I have also noticed
that this method return false MUCH more often than it returns true.  I have
been trying to figure out how the mechanics of this process work just in
case the issue was not in my code and I might have been able  to find the
problem.  But I have yet to find the problem either in Lucene 4.5.1 or
Lucene 4.6.  If anyone has any ideas as to what I might be doing wrong, I
would really appreciate reading what you have to say.  Thanks in advance.



Jason



                private void cloneDB() throws QueryNodeException {



                                Document doc;

                                ArrayList<String> fileNames;

                                int start = docRanges[(threadNumber * 2)];

                                int stop = docRanges[(threadNumber * 2) +
1];



                                try {



                                                fileNames = new
ArrayList<String>(docsPerThread);

                                                for (int i = start; i <
stop; i++) {

                                                                doc =
searcher.doc(i);

                                                                try {


adder.addDoc(doc);


fileNames.add(doc.get("FileName"));

                                                                } catch
(TransactionExceptionRE | TransactionException | LockConflictException te) {


adder.txnAbort();


System.err.println(Thread.currentThread().getName() + ": Adding a message
failed, retrying.");

                                                                }

                                                }


deleters[threadNumber].deleteTerms("FileName",
fileNames);


deleters[threadNumber].commit();



                                } catch (IOException | ParseException ex) {


Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
null, ex);

                                }

                }





                                public void deleteTerms(String
dbField,ArrayList<String> fieldTexts) throws IOException {

                                Term[] terms = new Term[fieldTexts.size()];

                                for(int i=0;i<fieldTexts.size();i++){

                                                terms[i]= new
Term(dbField,fieldTexts.get(i));

                                }

                                writer.deleteDocuments(terms);

                }



                public void deleteDocuments(Term... terms) throws
IOException





                Thread [DB Cloner 2] (Suspended)

                owns: BufferedUpdatesStream  (id=54)

                owns: IndexWriter  (id=49)

                FST<T>.readFirstRealTargetArc(long, Arc<T>, BytesReader)
line: 979

                FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
line: 1220


BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
line: 1679

                BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
ReadersAndUpdates, SegmentReader) line: 414

                BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
List<SegmentCommitInfo>) line: 283

                IndexWriter.applyAllDeletesAndUpdates() line: 3112

                IndexWriter.applyDeletesAndPurge(boolean) line: 4641


                DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
boolean, boolean) line: 673

                IndexWriter.processEvents(Queue<Event>, boolean, boolean)
line: 4665

                IndexWriter.processEvents(boolean, boolean) line: 4657


                IndexWriter.deleteDocuments(Term...) line: 1421

                DocDeleter.deleteTerms(String, ArrayList<String>)
line: 95


                DBCloner.cloneDB() line: 233

                DBCloner.run() line: 133

                Thread.run() line: 744

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message