lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitzan Shaked <nitzan.sha...@gmail.com>
Subject Sub-Sequence token filter
Date Thu, 15 May 2014 07:02:51 GMT
Hi list

I created a small token filter which I'd gladly "contribute", but want to
know if there's any interest in it before I go and make it pretty, add
documentation, etc... ;)

I originally created it to index domain names: I wanted to be able to
search for "google.com" and find "www.google.com" or "ads.google.com", "
mail.google.com", etc.

What it does is split a token (in my case -- according to "."), and then
outputs all sub-sequences. So "a,b,c,d" will output "a", "b", "c", "d",
"a.b", "b.c", "c.d", "a.b.c", "b.c.d", and "a.b.c.d". I use it only in the
"index" analyzer, and so am able to specify any of the generated tokens to
find the original token.

It has the following arguments:

sepRegexp: regular expression that the original token will be split
according to. (I use "[.]" for domains)
glue: string that will be used to join sub-sequences back together (I use
"." for domains)
minLen: minimum generated sub-sequence length
maxLen: maximum generated sub-sequence length (0 for unlimited, negative
numbers for token length minus specified amount)
anchor: "start" to only output prefixes, "end" to only output suffix, or
"none" to output any sub-sequence

So... is this useful to anyone?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message