lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Meehl (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset
Date Sat, 19 Jan 2019 00:59:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746759#comment-16746759
] 

Daniel Meehl edited comment on LUCENE-8651 at 1/19/19 12:58 AM:
----------------------------------------------------------------

As a little more of an explanation, all I did here was to replace the KeywordTokenStream (from
the 1st patch) to a KeywordTokenizer. This causes the test to fail with an IllegalStateException
because the KeywordTokenizer has it's close() and then reset() methods called which swaps
out the previously set reader for the Tokenizer.ILLEGAL_STATE_READER.


was (Author: dmeehl):
As a little more of an explanation, all I did here was to replace the KeywordTokenStream (from
the 1st patch) to a KeywordTokenizer. This causes the test to fail with an IllegalStateException
because the KeywordTokenizer has it's end and then reset methods called which swaps out the
previously set reader for the Tokenizer.ILLEGAL_STATE_READER.

> Tokenizer implementations can't be reset
> ----------------------------------------
>
>                 Key: LUCENE-8651
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8651
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Daniel Meehl
>            Priority: Major
>         Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() every time
before reset() is called. The reason for this is that Tokenizer violates the contract put
forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement this method
so that they can be reused, just as if they had been created fresh."
> Tokenizer implementation's reset function can't reset in that manner because their Tokenizer.close()
removes the reference to the underlying Reader because of LUCENE-2387. The catch-22 here is
that we don't want to unnecessarily keep around a Reader (memory leak) but we would like to
be able to reset() if necessary.
> The patches include an integration test that attempts to use a ConcatenatingTokenStream
to join an input TokenStream with a KeywordTokenizer TokenStream. This test fails with an
IllegalStateException thrown by Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message