nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Tanaman" <>
Subject RE: Creating Lucence Compound Index
Date Tue, 02 Jan 2007 13:34:56 GMT
True, but we are less than typical.  ;)  Seriously though, we are using
Nutch to conglomerate many small sources in the enterprise of varying shapes
and sizes, meaning many indexes (even when we merge as many together as
possible).  Others using Nutch in the enterprise for internal crawling may
face the same challenges.

We are at the edge of the acceptable limit, as our enterprise
implementations have a somewhat unusual situation:

* Each index has 20 fields (on average - some have 50! - but let's say 20)
* We have up to 30 indexes built on one machine, including helper indexes

Assuming a worst-case situation of 9 unmerged index-segments, we will get:
30 * 9 * (7 + 20) = 7,290 open files

Whereas with compound, it would be:
30 * 9 = 270 open files

We are currently considering changing the way we use the indexer so it is
incremental (adding a few changed files to the existing index instead of
creating a new one) so this will have the effect of indexes not always being
optimized, so plenty of segments in each index.

Agree about the performance degradation (estimated at 5-10% by Gospodnetic
et Hatcher), which only affects the indexing time, not the search time, but
we would put this as a clear caveat in the conf file.

We'd rather the incremental index process be a little slower (our big
performance problem is on parsing anyway), but that the file system work be
a little more manageable.

Are there any objections?

Best regards,
Alan Tanaman
iDNA Solutions

-----Original Message-----
From: Andrzej Bialecki [] 
Sent: 02 January 2007 13:07
Subject: Re: Creating Lucence Compound Index

Alan Tanaman wrote:
> Currently Nutch creates a Lucene multifile index, and makes sure any
> existing compound index is converted  to multifile by using the
> IndexWriter.setUseCompoundFile(false) method.
> This is done whenever an IndexWriter is opened in the following methods:
> org.apache.nutch.indexer.Indexer.getRecordWriter
> org.apache.nutch.indexer.IndexSorter.sort
> org.apache.nutch.indexer.IndexMerger.merge
> Is there a technical constraint as to why Nutch should ensure usage of
> multifile (or prevent compound) and not allow the type to be set by a
> property setting?
> Does anyone object to/support  a patch to allow this to be configurable?

Multifile indexes are somewhat faster, and require much less temporary 
space during indexing. Why would you want to use the compound format 
with Nutch? The typical use of Nutch is that you work with a single or 
at most couple (few) indexes per machine - in such case, regular 
non-compound index works better, and there is no danger of running out 
of file handles.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message