lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolae Mihalache <xproma...@gmail.com>
Subject faceted search cache and optimisations
Date Mon, 03 Aug 2009 08:45:30 GMT
Hello,

I'm using faceted search (perhaps in a dumb way) to collect some statistics
for my index. I have documents in various languages, one of the field is
"language" and I simply want to see how many documents I have for each
language. I have noticed that the search builds a int[maxDoc] array and then
traverses the array to count. If facet.method=enum (discovered later) is
used, the things are still counted in a different way. But for this case
where all the documents are retrieved, the information is already available
in the lucene index.
So, I think it would be a good optimization to detect these cases (i.e. no
filtering) and just return the number from the index instead of counting the
docs again.

Another issue: there is no way currently to disable the caching of the
int[maxDoc], is there? If there are many fields to be faceted, this can
quikly lead to out of memory situations. I think it would be good to give
the option (as part of the query) to disable the caching, even if it is
slow, at least it works and is useful for non-interactive processing.

And another possibe optimization for the int[maxDoc] inspired from the
column stored databases: the way they do it is to find the minimum number of
bits to represent a value. If for example my language field has 30 possible
values (i.e. I have docs in 30 languages), I only need 5 bits for each doc
(instead of int=32 bits). Then I can represent the whole int[maxDoc] in less
than 1/6 of the space required now.
What's even better, sometimes the documents can be partitioned such that not
all the values of a field are represented in the same partition.
For example let's assume that I have a field called doc_generation_date. If
I harverst the documents each three days, and I consider a partition as
having the same three days of data, for each partition I will basically have
only three possible values for the doc_generation_date. That means that I
only need to have 2 bits for each document plus a table for each partition
that maps from the partition value id (one of the three values represented
on two bits) to the index value id (that is the id stored in the lucene
index).
Of course, for the language field above, the partitioning would not help
unless I index successively only english docs, then only french, etc.
And also it wouldn't work just like that for multi-value fields.

nicolae

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message