lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: mergeFactor / indexing speed
Date Mon, 03 Aug 2009 19:05:08 GMT
How big are your documents?  I haven't benchmarked DIH, so I am not  
sure what to expect, but it does seem like something isn't right.  Can  
you fully describe how you are indexing?  Have you done any profiling?

On Aug 3, 2009, at 12:32 PM, Chantal Ackermann wrote:

> Hi all,
>
> I'm still struggling with the index performance. I've moved the  
> indexer
> to a different machine, now, which is faster and less occupied.
>
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
>
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so  
> far. Which means 1,5 hours at least for 200k - which is as fast/slow  
> as before (on the less performant machine).
>
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
> iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
>
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.23    0.00    0.03    0.03   98.71
>
> Basically, it is doing very little? *scratch*
>
> The sourcing database is responding as fast as ever. (I checked that  
> from my own machine, and did only a ping from the linux box to the  
> db server.)
>
> Any help, any hint on where to look would be greatly appreciated.
>
>
> Thanks!
> Chantal
>
>
> Chantal Ackermann schrieb:
>> Hi again!
>>
>> Thanks for the answer, Grant.
>>
>> > It could very well be the case that you aren't seeing any merges  
>> with
>> > only 20K docs.  Ultimately, if you really want to, you can look in
>> > your data.dir and count the files.  If you have indexed a lot and  
>> have
>> > an MF of 100 and haven't done an optimize, you will see a lot more
>> > index files.
>>
>> Do you mean that 20k is not representative enough to test those  
>> settings?
>> I've chosen the smaller data set so that the index can run completely
>> but doesn't take too long at the same time.
>> If it would be faster to begin with, I could use a larger data set,  
>> of
>> course. I still can't believe that 11 minutes is normal (I haven't
>> managed to make it run faster or slower than that, that duration is  
>> very
>> stable).
>>
>> It "feels kinda" slow to me...
>> Out of your experience - what would you expect as duration for an  
>> index
>> with:
>> - 21 fields, some using a text type with 6 filters
>> - database access using DataImportHandler with a query of (far) less
>> than 20ms
>> - 2 transformers
>>
>> If I knew that indexing time should be shorter than that, at least, I
>> would know that something is definitely wrong with what I am doing or
>> with the environment I am using.
>>
>> > Likely, but not guaranteed.  Typically, larger merge factors are  
>> good
>> > for batch indexing, but a lot of that has changed with Lucene's new
>> > background merger, such that I don't know if it matters as much  
>> anymore.
>>
>> Ok. I also read some posting where it basically said that the default
>> parameters are ok. And one shouldn't mess around with them.
>>
>> The thing is that our current search setup uses Lucene directly,  
>> and the
>> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
>> fields are different, the complete setup is different. But it will be
>> hard to advertise a new implementation/setup where indexing is three
>> times slower - unless I can give some reasons why that is.
>>
>> The full index should be fairly fast because the backing data is  
>> update
>> every few hours. I want to put in place an incremental/partial  
>> update as
>> main process, but full indexing might have to be done at certain  
>> times
>> if data has changed completely, or the schema has to be changed/ 
>> extended.
>>
>> > No, those are separate things.  The ramBufferSizeMB (although, I  
>> like
>> > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many  
>> docs
>> > Lucene holds in memory before it has to flush.  MF controls how  
>> many
>> > segments are on disk
>>
>> alas! the rum. I had that typo on the commandline before. that's my
>> subconscious telling me what I should do when I get home, tonight...
>>
>> So, increasing ramBufferSize should lead to higher memory usage,
>> shouldn't it? I'm not seeing that. :-(
>>
>> I'll try once more with MF 10 and a higher rum... well, you  
>> know... ;-)
>>
>> Cheers,
>> Chantal
>>
>> Grant Ingersoll schrieb:
>>> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
>>>
>>>> Dear all,
>>>>
>>>> I want to find out which settings give the best full index
>>>> performance for my setup.
>>>> Therefore, I have been running a small index (less than 20k
>>>> documents) with a mergeFactor of 10 and 100.
>>>> In both cases, indexing took about 11.5 min:
>>>>
>>>> mergeFactor: 10
>>>> <str name="Time taken ">0:11:46.792</str>
>>>> mergeFactor: 100
>>>> /admin/cores?action=RELOAD
>>>> <str name="Time taken ">0:11:44.441</str>
>>>> Tomcat restart
>>>> <str name="Time taken ">0:11:34.143</str>
>>>>
>>>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But  
>>>> it
>>>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM,  
>>>> old
>>>> ATA disk).
>>>>
>>>>
>>>> Now, I have three questions:
>>>>
>>>> 1. How can I check which mergeFactor is really being used? The
>>>> solrconfig.xml that is displayed in the admin application is the  
>>>> up-
>>>> to-date view on the file system. I tested that. But it's not
>>>> necessarily what the current SOLR core is using, isn't it?
>>>> Is there a way to check on the actually used mergeFactor (while the
>>>> index is running)?
>>> It could very well be the case that you aren't seeing any merges  
>>> with
>>> only 20K docs.  Ultimately, if you really want to, you can look in
>>> your data.dir and count the files.  If you have indexed a lot and  
>>> have
>>> an MF of 100 and haven't done an optimize, you will see a lot more
>>> index files.
>>>
>>>
>>>> 2. I changed the mergeFactor in both available settings (default  
>>>> and
>>>> main index) in the solrconfig.xml file of the core I am reindexing.
>>>> That is the correct place? Should a change in performance be
>>>> noticeable when increasing from 10 to 100? Or is the change not
>>>> perceivable if the requests for data are taking far longer than all
>>>> the indexing itself?
>>> Likely, but not guaranteed.  Typically, larger merge factors are  
>>> good
>>> for batch indexing, but a lot of that has changed with Lucene's new
>>> background merger, such that I don't know if it matters as much  
>>> anymore.
>>>
>>>
>>>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
>>>> (Or some other setting?)
>>> No, those are separate things.  The ramBufferSizeMB (although, I  
>>> like
>>> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many  
>>> docs
>>> Lucene holds in memory before it has to flush.  MF controls how many
>>> segments are on disk
>>>
>>>> (I am still trying to get profiling information on how much
>>>> application time is eaten up by db connection/requests/processing.
>>>> The root entity query is about (average) 20ms. The child entity
>>>> query is less than 10ms.
>>>> I have my custom entity processor running on the child entity that
>>>> populates the map using a multi-row result set. I have also  
>>>> attached
>>>> one regex and one script transformer.)
>>>>
>>>> Thank you for any tips!
>>>> Chantal
>>>>
>>>>
>>>>
>>>> --
>>>> Chantal Ackermann
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message