lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Meehl (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
Date Sat, 19 Jan 2019 01:00:53 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746705#comment-16746705
] 

Daniel Meehl edited comment on LUCENE-8650 at 1/19/19 1:00 AM:
---------------------------------------------------------------

[~romseygeek] Yes I will. The core issue is that Tokenizer implementations end up clearing
their Reader when they close() and thus can never reset() without setting a new Reader.


was (Author: dmeehl):
[~romseygeek] Yes I will. The core issue is that Tokenizer implementations end up clearing
their Reader when they end() and thus can never reset() without setting a new Reader.

> ConcatenatingTokenStream does not end() nor reset() properly
> ------------------------------------------------------------
>
>                 Key: LUCENE-8650
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8650
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Daniel Meehl
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling super.end()
in their end() methods. ConcatenatingTokenStream fails to do this. Because of this, it's final
offset is not readable and DefaultIndexingChain in turn fails to set the lastStartOffset properly.
This results in problems with indexing which can include unsearchable content or IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it does not set
its currentSource and offsetIncrement back to 0. Because of this, copyField directives (in
the schema) do not work and content becomes unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for finalOffset,
which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used for the testing
that Lucene may or may not want to merge in. This patch adds an integration test that loads
some content into the 'text' field. The schema then copies it to 'content' using a copyField
directive. The test searches in the content field for the loaded text and fails to find it
even though the field does contain the content. Flip the debug flag to see a nicer printout
of the response and what's in the index. Notice that the added class I alluded to is KeywordTokenStream
.This class had to be added because of another (ultimately unrelated) problem: ConcatenatingTokenStream
cannot concatenate Tokenziers. This is because Tokenizer violates the contract put forth by
TokenStream.reset(). This separate problem warrants its own ticket, though. However, ultimately
KeywordTokenStream may be useful to others and could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting a finalOffset
as the last task in the end() method, and resetting currentSource, offsetIncrement and finalOffset
when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message