lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Reh <...@hebis.uni-frankfurt.de>
Subject Re: indexing cpu utilization
Date Wed, 02 Jan 2013 21:39:22 GMT
Hi,

while trying to optimize our indexing workflow I reached the same 
endpoint like gabriel shen described in his mail. My Solr server won't 
utilize more than 40% of the computing power.
I made some tests, but i'm not able to find the bottleneck. Could 
anybody help to solve this quest?

At first let me describe the environment:

Server:
- Two socket Opteron (interlagos) => 32 cores
- 64Gb Ram (1600Mhz)
- SATA Disks: spindle and ssd
- Solaris 5.11
- JRE 1.7.0
- Solr 4.0
- ApplicationServer Jetty
- 1Gb network interface

Client:
- same hardware as client
- either multi threaded solrj client using multiple instances of 
HttpSolrServer
- or multi threaded solrj client using a ConcurrentUpdateSolrServer with 
100 threads

Problem:
- 10,000,000 docs of bibliographic data (~4k each)
- with a simplified schema definition it takes 10 hours to index <=> 
~250docs/second
- with the real schema.xml it takes 50 hours to index  <=> ~50docs/second
In both cases the client takes just 2% of the cpu resources and the 
server 35%. It's obvious that there is some optimization potential in 
the schema definition, but why uses the Server never more than 40% of 
the cpu power?


Discarded possible bottlenecks:
- Ram for the JVM
Solr takes only up to 12G of heap and there is just a negligible gc 
activity. So the increase from 16G to 32G of possible heap made no 
difference.
- Bandwidth of the net
The transmitted data is identical in both cases. The size of the 
transmitted data is somewhat below 50G. Since both machines have a 
dedicated 1G line to the switch, the raw transmission should not take 
much more than 10 minutes
- Performance of the client
Like above, the client ist fast enough for the simplified case (10h). A 
dry run (just preprocessing not indexing) may finish after 75 minutes.
- Servers disk IO
The size of the simpler index is ~100G the size of the other is ~150G. 
This makes factor of 1.5 not 5. The difference between a ssd and a real 
disk is not noticeable. The output of 'iostat' and 'zpool iostat' is 
unsuspicious.
- Bad thread distribution
'mpstat' shows a well distributed load over all cpus and a sensible 
amount of crosscalls (less than ten/cpu)
- Solr update parameter (solrconfig.xml)
Inspired from 
 >http://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1
I'm using:
> <ramBufferSizeMB>256</ramBufferSizeMB>
> <mergeFactor>40</mergeFactor>
> <termIndexInterval>1024</termIndexInterval>
> <lockType>native</lockType>
> <unlockOnStartup>true</unlockOnStartup>
Any changes on this Parameters made it worse.

To get an idea whats going on, I've done some statistics with visualvm. 
(see attachement)
The distribution of real and cpu time looks significant, but Im not 
smart enough to interpret the results.
The method 
org.apache.lucene.index.treadAffinityDocumentsWriterThreadPool.getAndLock() 
is active at 80% of the time but takes only 1% of the cpu time. On the 
other hand the second method 
org.apache.commons.codec.language.bm.PhoneticEngine$PhonemeBuilder.append() 
is active at 12% of the time and is always running on a cpu

So again the question "When there are free resources in all dimensions, 
why utilizes Solr not more than 40% of the computing Power"?
Bandwidth of the RAM?? I can't believe this. How to verify?
???

Any hints are welcome.
Uwe







Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message