lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: BufferedUpdateStreams breaks high performance indexing
Date Thu, 04 Aug 2016 07:14:21 GMT
After updating to version 5.5.3 it looks good now.
Thanks a lot for your help and advise.

Best regards
Bernd

Am 29.07.2016 um 15:04 schrieb Michael McCandless:
> The deleted terms accumulate whenever you use updateDocument(Term, Doc), or
> when you do deleteDocuments(Term).
> 
> Deleted queries are when you delete by query, but I don't think DIH would
> be doing that unless you asked it to ... maybe a Solr user/dev knows better?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> Yes, with default of 10 it performs very much better.
>> I didn't take into count that DIH uses updateDocument for adding new
>> documents but after thinking about the "why" I assume that
>> this might be because you don't know if a document already exists in the
>> index.
>> Conclusion, using DIH and setting segmentsPerTier to a high value is a
>> killer.
>>
>> One question still remains about messages in INFOSTREAM, I have lines
>> saying
>> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
>> deleted queries
>>            bytesUsed=2313024 delGen=2265 packetCount=69
>> totBytesUsed=262526720
>> ...
>> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
>> terms (unique count=0)
>>            97142 deleted queries bytesUsed=3108576]; coalesced deletes=
>>
>>  [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
>>             newDelCount=0
>>
>> Do you know what these deleted terms and deleted queries are?
>>
>> Best regards,
>> Bernd
>>
>>
>> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
>>> Hmm, your merge policy changes are dangerous: that will cause too many
>>> segments in the index, which makes it longer to apply deletes.
>>>
>>> Can you revert that and re-test?
>>>
>>> I'm not sure why DIH is using updateDocument instead of addDocument ...
>>> maybe ask on the solr-user list?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>
>>>> Currently I use concurrent DIH but will write some SolrJ for testing
>>>> or even as replacement for DIH.
>>>> Don't know whats behind DIH if only documents are added.
>>>>
>>>> Not tried any newer release yet, but after reading LUCENE-6161 I really
>>>> should.
>>>> At least a version > 5.1
>>>> May be before writing some SolrJ.
>>>>
>>>>
>>>> Yes IndexWriterConfig is changed from default:
>>>> <indexConfig>
>>>>     <maxIndexingThreads>8</maxIndexingThreads>
>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>>>       <int name="maxMergeAtOnce">8</int>
>>>>       <int name="segmentsPerTier">100</int>
>>>>       <int name="maxMergedSegmentMB">512</int>
>>>>     </mergePolicy>
>>>>     <mergeFactor>8</mergeFactor>
>>>>     <mergeScheduler
>>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>>>     <lockType>${solr.lock.type:native}</lockType>
>>>>     ...
>>>> </indexConfig>
>>>>
>>>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
>>>> Somewhere between 20 and 50 characters in length.
>>>>
>>>> Thanks for your help,
>>>> Bernd
>>>>
>>>>
>>>> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
>>>>> Hmm not good.
>>>>>
>>>>> If you are really only adding documents, you should be using
>>>>> IndexWriter.addDocument, which won't buffer any deleted terms and that
>>>>> method call should be a no-op.  It also makes flushes more efficient
>>>> since
>>>>> all of your indexing buffer goes to the added documents, not buffered
>>>>> delete terms.  Are you using updateDocument?
>>>>>
>>>>> Can you reproduce this slowness on a newer release?  There have been
>>>>> performance issues fixed in newer releases in this method, e.g
>>>>> https://issues.apache.org/jira/browse/LUCENE-6161
>>>>>
>>>>> Have you changed any IndexWriterConfig settings from defaults?
>>>>>
>>>>> What are your unique id fields like?  How many bytes in length?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
>>>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>>>
>>>>>> While trying to get higher performance for indexing it turned out
that
>>>>>> BufferedUpdateStreams is breaking indexing performance.
>>>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>>>>>>
>>>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
>>>> 4.10.4
>>>>>> API states:
>>>>>> "Determines the amount of RAM that may be used for buffering added
>>>>>> documents and deletions before they are flushed to the Directory.
>>>>>> Generally for faster indexing performance it's best to flush by RAM
>>>>>> usage instead of document count and use as large a RAM buffer as
you
>>>> can."
>>>>>>
>>>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>>>>>>
>>>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>>>>>> infos=...
>>>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
>>>> took
>>>>>> 3411845 msec
>>>>>>
>>>>>> About 56 minutes no indexing and only applying deletes.
>>>>>> What is it deleting?
>>>>>>
>>>>>> If the index gets bigger the time gets longer, currently 2.5 hours
of
>>>>>> waiting.
>>>>>> I'm adding 96 million docs with uniq id, no duplicates, only add,
no
>>>>>> deletes.
>>>>>>
>>>>>> Any suggestions which config is _really_ going for high performance
>>>>>> indexing?
>>>>>>
>>>>>> Best regards,
>>>>>> Bernd
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>
>> --
>> *************************************************************
>> Bernd Fehling                    Bielefeld University Library
>> Dipl.-Inform. (FH)                LibTec - Library Technology
>> Universitätsstr. 25                  and Knowledge Management
>> 33615 Bielefeld
>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message