lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
Date Tue, 04 Oct 2016 21:53:20 GMT


Michael McCandless commented on LUCENE-7465:

Thank you for the example [~dweiss].  Indeed that's a hard regexp to determinize.  It's interesting
because the determinization requires many states, yet it minimizes to an apparently contained
number of states (though many transitions).

E.g. at 30 clauses, determized form produced 7652 states and 136898 transitions, but after
minimize that drops to 150 states and 2960 transitions.  I tried to run {{dot}} on this FSA
but it struggles :)

Net/net the DFA approach is not usable in some cases (like this one); such users must use
the JDK implementation.  Maybe we should explore an {{re2j}} version too.

bq. Btw. if you're looking into this again, piggyback a change to Operations.determinize and
replace LinkedList with an ArrayDeque, it certainly won't hurt.

Excellent, I'll fold that in!

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---------------------------------------------------------------
>                 Key: LUCENE-7465
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.3
>         Attachments: LUCENE-7465.patch, LUCENE-7465.patch
> I think there are some nice benefits to a version of PatternTokenizer that uses Lucene's
RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp is attempted
the user discovers it up front instead of later on when a "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 characters at
a time, vs the existing {{PatternTokenizer}} which currently reads the entire string up front
(this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and improved tests,
but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps don't yet implement
sub group capture.  I think we could add that at some point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we did that
we should maybe name it differently ({{SimplePatternSplitTokenizer}}?).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message