lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nitzan Shaked (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5674) A new token filter: SubSequence
Date Mon, 26 May 2014 04:28:01 GMT


Nitzan Shaked commented on LUCENE-5674:


1) I'll attach a "squashed" version of the patch, without history, hopefully that'll be easier
to read.
2) I don't know how to "prove" that something can't be done using existing analysis components,
but after spending quite some time on this, and after asking on S.O., I am fairly convinced
that it indeed cannot be done using existing components.
3) Instantiating with minLen>maxLen is ok, since maxLen can be negative (-2 to count 2
sub-tokens from the end, for example). It might also happen that minLen may be greater than
some tokens' lengths. In those cases there will simply be no output for the given token. I'll
add a check that when both minLen and maxLen are positive, then minLen <= maxLen.

Otis: while I'm adding this last check, I'll also add the "reverse" option, I can see why
that might be useful.

> A new token filter: SubSequence
> -------------------------------
>                 Key: LUCENE-5674
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/other
>            Reporter: Nitzan Shaked
>            Priority: Minor
>         Attachments: subseqfilter.patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
> A new configurable token filter which, given a token breaks it into sub-parts and outputs
consecutive sub-sequences of those sub-parts.
> Useful for, for example, using during indexing to generate variations on domain names,
so that "" can be found by searching for "", or "".
> Parameters:
> sepRegexp: A regular expression used split incoming tokens into sub-parts.
> glue: A string used to concatenate sub-parts together when creating sub-sequences.
> minLen: Minimum length (in sub-parts) of output sub-sequences
> maxLen: Maximum length (in sub-parts) of output sub-sequences (0 for unlimited; negative
numbers for token length in sub-parts minus specified length)
> anchor: Anchor.START to output only prefixes, or Anchor.END to output only suffixes,
or Anchor.NONE to output any sub-sequence
> withOriginal: whether to output also the original token
> EDIT: now includes tests for filter and for factory.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message