lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Conrad <>
Subject Re: Optimal index structure
Date Wed, 26 Jan 2005 01:42:09 GMT

On Jan 25, 2005, at 5:29 PM, Tea Yu wrote:

>   How many total documents will be there?  I'll opt for a single index 
> if
> search in "all categories" meets the performance target, else you may 
> want
> to consider distributed searchers.  arguments for a single index:

Fortunately, there is no need for an all categories search.  I won't be 
searching across categories, just across document types.  Total, there 
will be somewhere near 15,000,000 documents across about 100,000 
categories.  But, again, the distribution across categories is very 
uneven.  There will be categories with a total of 5 or so documents, 
with other categories having over 100,000.

>   1) all doc scores will have to be calculated anyway leveraging 
> Searcher or
> (Parallel)MultiSearcher which should be most expensive (with a slight
> overhead to aggregate and sort the docs in the latter)
>   2) you'll most likely want to aggregate N categories into an index 
> anyway
> to avoid having too many opened files

I am concerned about the number of concurrent open files, but I think 
that may be mitigated since some categories will receive virtually no 
searches (since they have very few documents or those documents are 
mostly very old).  I would say that the number of categories searched 
frequently will be under 5000.  I was thinking of using a LRU cache of 
open indexes which would keep the number of open files under control 
and ensure that frequently used indexes are quickly available.

>   3) most of the time will be spent in context switching if having too 
> many
> indexes searched in parallel

I will be limiting the number of search threads to 4-12 (this will be 
running a dedicated quad xeon, most likely).

>   an alternative will be to optimize the structure base on usage 
> pattern,
> e.g. having 1 full category index and several sub-categories indexes, 
> if
> reindexing is not a problem

Re-indexing will be an issue since it looks like it will take on the 
order of 3-4 days to index everything.

Thanks for your input.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message