lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: mergeFactor / indexing speed
Date Mon, 03 Aug 2009 16:56:33 GMT
Hi,

I'd have to poke around the machine(s) to give you better guidance, but here is some initial
feedback:

- mergeFactor of 1000 seems crazy.  mergeFactor is probably not your problem.  I'd go back
to default of 10.
- 256 MB for ramBufferSizeMB sounds OK.
- pinging the DB won't tell you much about the DB server's performance - ssh to the machine
and check its CPU load, memory usage, disk IO

Other things to look into:
- Network as the bottleneck?
- Field analysis as the bottleneck?


Otis 
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Chantal Ackermann <chantal.ackermann@btelligent.de>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Monday, August 3, 2009 12:32:12 PM
> Subject: Re: mergeFactor / indexing speed
> 
> Hi all,
> 
> I'm still struggling with the index performance. I've moved the indexer
> to a different machine, now, which is faster and less occupied.
> 
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
> 
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so far. 
> Which means 1,5 hours at least for 200k - which is as fast/slow as 
> before (on the less performant machine).
> 
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
>   iostat
> Linux 2.6.9-67.ELsmp      08/03/2009
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             1.23    0.00    0.03    0.03   98.71
> 
> Basically, it is doing very little? *scratch*
> 
> The sourcing database is responding as fast as ever. (I checked that 
> from my own machine, and did only a ping from the linux box to the db 
> server.)
> 
> Any help, any hint on where to look would be greatly appreciated.
> 
> 
> Thanks!
> Chantal
> 
> 
> Chantal Ackermann schrieb:
> > Hi again!
> >
> > Thanks for the answer, Grant.
> >
> >  > It could very well be the case that you aren't seeing any merges with
> >  > only 20K docs.  Ultimately, if you really want to, you can look in
> >  > your data.dir and count the files.  If you have indexed a lot and have
> >  > an MF of 100 and haven't done an optimize, you will see a lot more
> >  > index files.
> >
> > Do you mean that 20k is not representative enough to test those settings?
> > I've chosen the smaller data set so that the index can run completely
> > but doesn't take too long at the same time.
> > If it would be faster to begin with, I could use a larger data set, of
> > course. I still can't believe that 11 minutes is normal (I haven't
> > managed to make it run faster or slower than that, that duration is very
> > stable).
> >
> > It "feels kinda" slow to me...
> > Out of your experience - what would you expect as duration for an index
> > with:
> > - 21 fields, some using a text type with 6 filters
> > - database access using DataImportHandler with a query of (far) less
> > than 20ms
> > - 2 transformers
> >
> > If I knew that indexing time should be shorter than that, at least, I
> > would know that something is definitely wrong with what I am doing or
> > with the environment I am using.
> >
> >  > Likely, but not guaranteed.  Typically, larger merge factors are good
> >  > for batch indexing, but a lot of that has changed with Lucene's new
> >  > background merger, such that I don't know if it matters as much anymore.
> >
> > Ok. I also read some posting where it basically said that the default
> > parameters are ok. And one shouldn't mess around with them.
> >
> > The thing is that our current search setup uses Lucene directly, and the
> > indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
> > fields are different, the complete setup is different. But it will be
> > hard to advertise a new implementation/setup where indexing is three
> > times slower - unless I can give some reasons why that is.
> >
> > The full index should be fairly fast because the backing data is update
> > every few hours. I want to put in place an incremental/partial update as
> > main process, but full indexing might have to be done at certain times
> > if data has changed completely, or the schema has to be changed/extended.
> >
> >  > No, those are separate things.  The ramBufferSizeMB (although, I like
> >  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
> >  > Lucene holds in memory before it has to flush.  MF controls how many
> >  > segments are on disk
> >
> > alas! the rum. I had that typo on the commandline before. that's my
> > subconscious telling me what I should do when I get home, tonight...
> >
> > So, increasing ramBufferSize should lead to higher memory usage,
> > shouldn't it? I'm not seeing that. :-(
> >
> > I'll try once more with MF 10 and a higher rum... well, you know... ;-)
> >
> > Cheers,
> > Chantal
> >
> > Grant Ingersoll schrieb:
> >> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
> >>
> >>> Dear all,
> >>>
> >>> I want to find out which settings give the best full index
> >>> performance for my setup.
> >>> Therefore, I have been running a small index (less than 20k
> >>> documents) with a mergeFactor of 10 and 100.
> >>> In both cases, indexing took about 11.5 min:
> >>>
> >>> mergeFactor: 10
> >>> 0:11:46.792
> >>> mergeFactor: 100
> >>> /admin/cores?action=RELOAD
> >>> 0:11:44.441
> >>> Tomcat restart
> >>> 0:11:34.143
> >>>
> >>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
> >>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
> >>> ATA disk).
> >>>
> >>>
> >>> Now, I have three questions:
> >>>
> >>> 1. How can I check which mergeFactor is really being used? The
> >>> solrconfig.xml that is displayed in the admin application is the up-
> >>> to-date view on the file system. I tested that. But it's not
> >>> necessarily what the current SOLR core is using, isn't it?
> >>> Is there a way to check on the actually used mergeFactor (while the
> >>> index is running)?
> >> It could very well be the case that you aren't seeing any merges with
> >> only 20K docs.  Ultimately, if you really want to, you can look in
> >> your data.dir and count the files.  If you have indexed a lot and have
> >> an MF of 100 and haven't done an optimize, you will see a lot more
> >> index files.
> >>
> >>
> >>> 2. I changed the mergeFactor in both available settings (default and
> >>> main index) in the solrconfig.xml file of the core I am reindexing.
> >>> That is the correct place? Should a change in performance be
> >>> noticeable when increasing from 10 to 100? Or is the change not
> >>> perceivable if the requests for data are taking far longer than all
> >>> the indexing itself?
> >> Likely, but not guaranteed.  Typically, larger merge factors are good
> >> for batch indexing, but a lot of that has changed with Lucene's new
> >> background merger, such that I don't know if it matters as much anymore.
> >>
> >>
> >>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
> >>> (Or some other setting?)
> >> No, those are separate things.  The ramBufferSizeMB (although, I like
> >> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
> >> Lucene holds in memory before it has to flush.  MF controls how many
> >> segments are on disk
> >>
> >>> (I am still trying to get profiling information on how much
> >>> application time is eaten up by db connection/requests/processing.
> >>> The root entity query is about (average) 20ms. The child entity
> >>> query is less than 10ms.
> >>> I have my custom entity processor running on the child entity that
> >>> populates the map using a multi-row result set. I have also attached
> >>> one regex and one script transformer.)
> >>>
> >>> Thank you for any tips!
> >>> Chantal
> >>>
> >>>
> >>>
> >>> --
> >>> Chantal Ackermann
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com/
> >>
> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> >> using Solr/Lucene:
> >> http://www.lucidimagination.com/search
> >>


Mime
View raw message