lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Woodward (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer
Date Mon, 01 Oct 2018 07:47:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633677#comment-16633677
] 

Alan Woodward commented on LUCENE-8516:
---------------------------------------

Comment from [~msokolov@gmail.com]:

My current usage of this filter requires it to be a filter, since I need to precede it with
other filters. I think the idea of not touching offsets preserves more flexibility, and since
the offsets are already unreliable, we wouldn't be losing much.

> Make WordDelimiterGraphFilter a Tokenizer
> -----------------------------------------
>
>                 Key: LUCENE-8516
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8516
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8516.patch
>
>
> Being able to split tokens up at arbitrary points in a filter chain, in effect adding
a second round of tokenization, can cause any number of problems when trying to keep tokenstreams
to contract.  The most common offender here is the WordDelimiterGraphFilter, which can produce
broken offsets in a wide range of situations.
> We should make WDGF a Tokenizer in its own right, which should preserve all the functionality
we need, but make reasoning about the resulting tokenstream much simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message