lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chantal Ackermann <>
Subject Re: mergeFactor / indexing speed
Date Thu, 06 Aug 2009 16:45:51 GMT
Hi all,

to keep this thread up to date... ;-)

d) jdbc batch size
changed to 10. (Was default: 500, then 1000)

The problem with my dih setup is that the root entity query returns a 
huge set (all ids that shall be indexed). A larger fetchsize would be 
good for that query.
The nested entity, however, returns only up 9 rows, ever. The 
constraints are so strict (by id) that there is no way that any 
additional data could be pre-fetched.
(Actually, anynone using DIH with nested entities should run into that 

After changing to 10, I cannot see that this low batch size slowed the 
indexer down (significantly).

As I would like to stick with DIH (instead of dumping the data into CSV 
and import it then) here is my question:

Do you think it's possible to return (in the nested entity) rows 
independent of the unique id, and let the processor decide when a 
document is complete?
The examples in the wiki always use an ID to get the data for the nested 
entity, so I'm not sure it was planned with that in mind. But as I'm 
already handling multiple db rows for one document, it might not be too 
difficult to change to handling the unique id correctly, as well?
Of course, I would need something like a look ahead to know whether the 
next row is already part of the next document.


Concerning the other settings (just fyi):

a) mergeFactor 10 (and also tried 100)
I don't think that changed anything to the worse, rather to the better. 
So, I'll stick with 10 from now on.

b) ramBufferSizeMB
tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not 
sure about 1024. I'll stick to 512.

View raw message