commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen Colebourne" <>
Subject Re: [collections] [lang] Req: Trie/ternary tree data structure
Date Mon, 01 Dec 2003 20:42:11 GMT
The subject of Tries has come up before, it'll all be on the archives. I
think there even may have been an implementation floating around.

At the time I blocked its addition to collections because it strikes me as a
relatively specific collection. My view basically remains that - collections
is now very big, and has recently effectively split into three.

If you want specific guidance on something, someone may be able to help. Its
not something I've played with, although I can sort of understand the design
and why it might be used.

Not sure that help really ;-)

----- Original Message -----
From: "J.Pietschmann" <>
> Hi all,
> I'd like to tap into the community wisdom for some thoughts.
>   == Background ==
> There are various task which require all substrings of a certain
> string to be looked up in a collection. The application I'm
> concerned with is pattern based word hyphenation, but there are
> others (spell check, stemming).
> An example: take the string "word". The task requires looking up
> the 10 strings "w", "wo", "wor", "word", "o", "or", "ord", "r",
> "rd" and "d".
> The idea is to take advantage of the fact that the strings to look
> up are related because they are substrings of another string. Another
> point is that you know that you deal with strings/characters, not
> arbitrary data as keys.
> Liang, Knuth et al. used a trie for handling this efficiently, both
> in terms of performance and memory. A node in a trie is basically
> a sorted array of characters, so that the character can be looked up
> quickly. The slot then points to the stored value as well as to
> another node.
> If it is known that a key string in the trie is never a substring
> of another key string, the pointers to the value and the following
> node can take up the same storage. All of the patters for hyphenation
> i've seen have this property, but there is no guarantee.
> The Apache FOP hyphenator uses a ternary tree as implementation.
> Basically it's the same, except the sorted array is replaced by a
> binary tree. Instread of using Java object for the tree nodes, tree
> char arrays are used for storing indices to the respective following
> nodes. Ideally, the binary trees should be balanced in order to provide
> the same O(log(n)) time for looking up like the sorted arrays in the trie
> do. Of course, if the ternary tree is build by inserting random key
> strings this requires red-black-trees or a similar mechanism.
> See
>    layout/hyphenation/
>    content-type=text/vnd.viewcvs-markup
> for the current state of the art.
> An example how strings are looked up in a trie: Suppose the collection
> contains the key strings "ord", "ot" and "rda".
> The trie is roughly
>      (0) o  r
>         /    \
>   (1) r t    d (2)
>       | -    |
>   (3) d      a (4)
>       -      -
> Now for the string lookup
> i=0 j=0 c='w' no match in (0) -> continue
> i=1 j=0 c='o' match 'o' in (0) -> (1)
>      j=1 c='r' match 'r' in (1) -> (3)
>      j=3 c='d' match 'd' in (3), leaf node, get value
> i=2 j=0 c='r' match 'r' in (0) -> (2)
>      j=1 c='d' match 'd' in (2) -> (4)
>      eos but no leaf node -> continue
> i=3 j=0 c='d' no match (0) -> continue
> You see, none of the longer substrings starting with "w" is actually
> looked up, which should provide some performance gain over simply using
> a hashtable.
>   == Requirements ==
> The data structure is built once from mor or less randomly distributed key
> strings. After this, it is used for many lookups. This means the key
> strings can first be gathered and analyzed, then stored into an optimized
> data structure.
> The number of expected key strings is of the order of several thousend to
> a few hundred thousand. Efficient memory usage is a must. Tradeoffs
> using less memory and lookup performance should be carefully considered.
> It should be possible to make the optimized data structure persistent. It
> should load fast from a data or serialized object file. Unfortunately, I
> have no experience whether it is more advanteous to serialize a tree of
> Java objects or an object with a one or few big arrays.
>   == Considerations ==
> 1. As already said, for the current hyphenation patters no key string
> is a substring at the beginning of another key. I.e if there is a "wor"
> key, there won't be "word" or "wora" keys, but there may be a "kword"
> key. This may allow some storage optimizations. (see
>    ?rev=1.2&content-type=text/vnd.viewcvs-markup
> for an example, the keys are the strings in the <patterns> sections sans
> the digits)
> However for using such a data structure for stemming, there may be keys
> which are also at the satrt of other keys. Actually, the curent state for
> hyphenation patters is to some extent an artefact of how the patterns are
> created, and it may well be it means we are using suboptimal patterns.
> 2. In most cases, the set of unicode characters in the key strings is
> It is possible that the charaters can be mapped onto bytes. For
> a mapping of the characters of the original string is necessary anyway
> (normalizing uppercase characters to lowercase etc.).
> Ok, now what do you think?
> Regards
> J.Pietschmann
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message