lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: Edit-distance strategy (slicing and one vs. all algorithms)
Date Fri, 09 Jun 2006 15:09:00 GMT
 > Can you share how you implemented Trie in your App, especialy 
interesting part for me is how you go about memory consumption,
 > have you tried really large dictionaries (1Mio+)?

Sure can.  You could slog through the source,
which is available from:

There are several trie implementations.  The
approximate dictionary trie's implemented very crudely.
We've run it on 1M gene names derived from
EntrezGene (anywhere from 2 to 60 characters long).
I don't know the speed numbers offhand, but nothing
to write home about.  The real problem is that it too
easily matched short terms to short terms within a
given edit distance.  So a gene name AT was
matching GT or IT or whatever at edit-distance 1,
whereas alpha-ACT wasn't matching AACT like
we wanted.  Still haven't solved this problem -- we
went back to using the method we described in
our case study in Lucene in Action.

The spelling checker trie for known tokens is also very crude, as it's not
a bottleneck in that processing (not even .1% of the profiled time).
We've scaled that to several million entries in a large customer 
The language model tries were, though.

The trie underlying our language models is implemented
pretty space-efficiently in-memory.  It only stores counts
at each node, not general records or even terminal/non-terminal
information as you'd need in a word list.  It codes trie nodes to an 
with PAT implementations, shared terminal node implementations,
etc., with unfolding all the way down to elements so that
nodes with 1-3 daughters don't allocate arrays, counts below
256 are represented with bytes, etc.  It uses a factory to sort out
all the object creations/sharing.  I wouldn't
normally be so agressive optimzing this kind of thing, but memory
and speed were a real problem with the naive implementation, and
this sits in most of our inner loops for model training.  

The tries can be compiled into a form where they can be easily
written to disk and read back in for efficient run-time use
for language modeling (estimating probability of a string
given a topic, language, sentiment, etc.).  That's the critical
step in most of our run-time operations.

Anyway, there's a paper on the scalable LM tries that covers everything
in pretty excruciating detail (even more than this message):

In LingPipe 2.3, which should be out next week, there are methods
to write character tries to disk and to merge on-disk tries with
and without pruning.  This uses some of the same bit-level
coding techniques (gamma-coding, specifically) as reverse-indexes for 
search engines, and yields
a very tight memory representation with good merging properties.
I've basically adopted Lucene's strategy of on-disk representations
writing out to a mergeable streaming format that can scale on disk
(or in memory).

- Bob Carpenter

eks dev wrote:
> Hi Bob, 
> really nice answer!
>> The real gain would be to do something like the
>> edit-distance generalization of Aho-Corasick.  The
>> basic idea is that instead of n iterations of string vs. string,
>> you do one iteration of string vs. trie. 
> I was experimenting a bit with ternary trie as it has some nice properties, e.g being
40% faster than standard java or trove HashMap for exact matches,  but never got to finish
path compression and null node reduction (this way one saves huge amount of memory). Must
do it one day. 
> thanks!
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message