lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomás Fernández Löbbe <tomasflo...@gmail.com>
Subject Re: Japanese Tokenizer using User Dictionary
Date Wed, 06 Apr 2016 00:50:54 GMT
Thanks Christian,
I created https://issues.apache.org/jira/browse/LUCENE-7181

On Mon, Apr 4, 2016 at 11:38 PM, Christian Moen <cm@atilika.com> wrote:

> Hello again Tomás,
>
> Thanks.  I agree entirely.  If you open a JIRA and I'll have a look and
> make improvements.
>
> Best regards,
>
> Christian Moen
> アティリカ株式会社
> https://www.atilika.com
>
> On Apr 5, 2016, at 15:12, Tomás Fernández Löbbe <tomasflobbe@gmail.com>
> wrote:
>
> Thanks Christian,
> I don't have a different use case, but If what I said is the expected
> behavior, I think we should validate the User Dictionary at create time
> (and allow only proper tokenization) instead of breaking when using the
> tokenizer.
> If you agree I'll create a Jira for that.
>
> Thanks,
>
> Tomás
>
> On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <cm@atilika.com> wrote:
>
>> Hello Tomás,
>>
>> What you are describing is the expected behaviour.  If you have any
>> specific use cases that motivate how this perhaps should be changed, I'm
>> very happy learn more about them to see how we can improve things.
>>
>> Many thanks,
>>
>> Christian Moen
>> アティリカ株式会社
>> https://www.atilika.com
>>
>> > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <tomasflobbe@gmail.com>
>> wrote:
>> >
>> > If I understand correctly, the user dictionary in the JapaneseTokenizer
>> allows users to customize how a stream is broken into tokens using a
>> specific set of rules provided like:
>> > AABBBCC -> AA BBB CC
>> >
>> > It does not allow users to change any of the characters like:
>> >
>> > AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC",
>> seems to only care about positions)
>> >
>> > It also doesn't let a character be part of more than one token, like:
>> >
>> > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
>> >
>> > ..or make the output token bigger than the input text:
>> >
>> > AA -> AAA (Also AIOOBE)
>> >
>> > Is this the expected behavior? maybe cases 2-4 should be handled by
>> adding filters then. If so, is there any cases where the user dictionary
>> should accept any tokenization were the original text is different than the
>> sum of the tokens?
>> >
>> > Tomás
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
>

Mime
View raw message