lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chantal Ackermann <>
Subject Re: mergeFactor / indexing speed
Date Tue, 04 Aug 2009 06:47:45 GMT
Hi Avlesh,
hi Otis,
hi Grant,
hi all,

(enumerating to keep track of all the input)

a) mergeFactor 1000 too high
I'll change that back to 10. I thought it would make Lucene use more RAM 
before starting IO.

b) ramBufferSize:
OK, or maybe more. I'll keep that in mind.

c) solrconfig.xml - default and main index:
I've always changed both sections, the default and the main index one.

d) JDBC batch size:
I haven't set it. I'll do that.

e) DB server performance:
I agree, ping is definitely not much information. I also did queries 
from my own computer towards it (while the indexer ran) which came back 
as fast as usual.
Currently, I don't have any login to ssh to that machine, but I'm going 
to try get one.

f) Network:
I'll definitely need to have a look at that once I have access to the db 

g) the data

g.1) nested entity in DIH conf
there is only the root and one nested entity. However, that nested 
entity returns multiple rows (about 10) for one query. (Fetched rows is 
about 10 times the number of processed documents.)

g.2) my custom EntityProcessor
( The code is pasted at the very end of this e-mail. )
- iterates over those multiple rows,
- uses one column to create a key in a map,
- uses two other columns to create the corresponding value (String 
- if a key already exists, it gets the value, if that value is a list, 
it adds the new value to that list, if it's not a list, it creates one 
and adds the old and the new value to it.
I refrained from adding any business logic to that processor. It treats 
all rows alike, no matter whether they hold values that can appear 
multiple or values that must appear only once.

g.3) the two transformers
- to split one value into two (regex)
<field column="person" />
<field column="participant" sourceColName="person" regex="([^\|]+)\|.*"/>
<field column="role" sourceColName="person" 

- to create extract a number from an existing number (bit calculation 
using the script transformer). As that one works on a field that is 
potentially multiValued, it needs to take care of creating and 
populating a list, as well.
<field column="cat" name="cat" />
function getMainCategory(row) {
	var cat = row.get('cat');
	var mainCat;
	if (cat != null) {
		// check whether cat is an array
		if (cat instanceof java.util.List) {
			var arr = java.util.ArrayList();
			for (var i=0; i<cat.size(); i++) {
				mainCat = new java.lang.Integer(cat.get(i)>>8);
				if (!arr.contains(mainCat)) {
			row.put('maincat', arr);
		} else { // it is a single value
			var mainCat = new java.lang.Integer(cat>>8);
			row.put('maincat', mainCat);
	return row;
(The EpgValueEntityProcessor decides on creating lists on a case by case 
basis: only if a value is specified multiple times for a certain data 
set does it create a list. This is because I didn't want to put any 
complex configuration or business logic into it.)

g.4) fields
the DIH extracts 5 fields from the root entity, 11 fields from the 
nested entity, and the transformers might create additional 3 (multiValued).
schema.xml defines 21 fields (two additional fields: the timestamp field 
(default="NOW") and a field collecting three other text fields for 
default search (using copy field)):
- 2 long
- 3 integer
- 3 sint
- 3 date
- 6 text_cs (class="solr.TextField" positionIncrementGap="100"):
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"
generateWordParts="0" generateNumberParts="0" catenateWords="0" 
catenateNumbers="0" catenateAll="0" />
- 4 text_de (one is the field populated by copying from the 3 others):
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LengthFilterFactory" min="2" max="5000" />
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />

Thank you for taking your time!

************** *******************

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.SqlEntityProcessor;

public class EpgValueEntityProcessor extends SqlEntityProcessor {
	private static final Logger log = Logger
	private static final String ATTR_ID_EPG_DEFINITION = 
	private static final String ATTR_COLUMN_ATT_NAME = "columnAttName";
	private static final String ATTR_COLUMN_EPG_VALUE = "columnEpgValue";
	private static final String ATTR_COLUMN_EPG_SUBVALUE = "columnEpgSubvalue";
	private static final String DEF_ATT_NAME = "ATT_NAME";
	private static final String DEF_EPG_VALUE = "EPG_VALUE";
	private static final String DEF_EPG_SUBVALUE = "EPG_SUBVALUE";
	private static final String DEF_ID_EPG_DEFINITION = "ID_EPG_DEFINITION";
	private String colIdEpgDef = DEF_ID_EPG_DEFINITION;
	private String colAttName = DEF_ATT_NAME;
	private String colEpgValue = DEF_EPG_VALUE;
	private String colEpgSubvalue = DEF_EPG_SUBVALUE;

	public void init(Context context) {
		colIdEpgDef = context.getEntityAttribute(ATTR_ID_EPG_DEFINITION);
		colAttName = context.getEntityAttribute(ATTR_COLUMN_ATT_NAME);
		colEpgValue = context.getEntityAttribute(ATTR_COLUMN_EPG_VALUE);
		colEpgSubvalue = context.getEntityAttribute(ATTR_COLUMN_EPG_SUBVALUE);

	public Map<String, Object> nextRow() {
		if (rowcache != null)
			return getFromRowCache();
		if (rowIterator == null) {
			String q = getQuery();
		Map<String, Object> pivottedRow = new HashMap<String, Object>();
		Map<String, Object> epgValue;
		String attName, value, subvalue;
		Object existingValue, newValue;
		String id = null;
		// return null once the end of that data set is reached
		if (!rowIterator.hasNext()) {
			rowIterator = null;
			return null;
		// as long as there is data, iterate over the rows and pivot them
		// return the pivotted row after the last row of data has been reached
		do {
			epgValue =;
			id = epgValue.get(colIdEpgDef).toString();
			assert id != null;
			if (pivottedRow.containsKey(colIdEpgDef)) {
				assert id.equals(pivottedRow.get(colIdEpgDef));
			} else {
				pivottedRow.put(colIdEpgDef, id);
			attName = (String) epgValue.get(colAttName);
			if (attName == null) {
				log.warning("No value returned for attribute name column "
						+ colAttName);
			value = (String) epgValue.get(colEpgValue);
			subvalue = (String) epgValue.get(colEpgSubvalue);

			// create a single object for value and subvalue
			// if subvalue is not set, use value only, otherwise create string
			// array
			if (subvalue == null || subvalue.trim().length() == 0) {
				newValue = value;
			} else {
				newValue = value + "|" + subvalue;

			// if there is already an entry for that attribute, extend
			// the existing value
			if (pivottedRow.containsKey(attName)) {
				existingValue = pivottedRow.get(attName);
//				newValue = existingValue + " " + newValue;
//				pivottedRow.put(attName, newValue);
				if (existingValue instanceof List) {
					((List) existingValue).add(newValue);
				} else {
					ArrayList v = new ArrayList();
					Collections.addAll(v, existingValue, newValue);
					pivottedRow.put(attName, v);
			} else {
				pivottedRow.put(attName, newValue);
		} while (rowIterator.hasNext());
		pivottedRow = applyTransformer(pivottedRow);
		return pivottedRow;


View raw message