lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7267) Field with an explicit TokenStream must be tokenized and then uses the default Analyzer offset gaps
Date Mon, 02 May 2016 04:23:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266101#comment-15266101
] 

David Smiley commented on LUCENE-7267:
--------------------------------------

RE the default offset gap being 1 -- it's been this way since I don't know how long.  Note
that the PostingsHighlighter assumes a single char offset gap.  What do you think Lucene _should_
be doing here?  It's not clear to me what you propose.  What it's doing seems fine to me but
maybe I'm not understanding your point?

> Field with an explicit TokenStream must be tokenized and then uses the default Analyzer
offset gaps
> ---------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7267
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7267
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Priority: Minor
>
> This took me somewhat by surprise. We have a pretty complex code that uses fields with
explicit token streams (which provide their own offset data) and multivalues.
> It was surprising to see that offsets for subsequent values were shifted by 1 compared
to what was explicitly provided in the OffsetAttribute. A bit of debugging showed this code
inside {{PerField.invert}}:
> {code}
>       if (analyzed) {
>         invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
>         invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
>       }
> {code}
> A field with an explicit token stream must still be declared as tokenized and PerField
then thinks that this field must have come from an analyzer (where in fact it didn't):
> {code}
>       final boolean analyzed = fieldType.tokenized() && docState.analyzer !=
null;
> {code}
> While the default position increment is 0, the default offset gap isn't -- it's 1, causing
the shift.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message