lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Turnbull <dturnb...@opensourceconnections.com>
Subject PositionFilter Deprecation and Questioning the associated Analyzer Invariant
Date Thu, 20 Nov 2014 16:14:01 GMT
I've run into an issue where I think I'd like to use the PositionFilter
plus RemoveDuplicatesFilter to deduplicate tokens, effectively removing the
impact of term frequency for a specific field without having to convince my
client to accept a java plugin (phrase queries don't matter in this case).

I realized that PositionFilter is deprecated, as per this Jira issue:
https://issues.apache.org/jira/browse/LUCENE-4981

The best justification I can find for this deprecation is this invariant
stated in the Jira issue:

>There are invariants that need to be maintained by token filters: all
tokens that start at the same position must have the same start offset and
all tokens that end at the same position (start position + position length)
must have the same end offset (see ValidatingFilter). By arbitrarily
changing position increments, PositionFilter breaks these invariants.

I question this invariant

I can see why this invariant is important for several features, such as
highlighting. On the other hand, its extremely common to copy fields to
have alternate analysis run on them (ie solr copyFields). These fields will
only ever be indexed and never displayed to the user. Does this invariant
still matter in this case?

I could see adjusting offsets in an analyzer. However, I feel like offsets
are a bit sacrosanct -- they refer to a character offset in the original
document -- not the result of analysis. Am I wrong in feeling this way?

So I question why PositionFilter was deprecated. It feels like the
invariant makes sense for any field displayed to users, but many times we
create fields with different analyzer chains that don't need to concern
themselves with features that care about the sanity of the token graph. It
seems this should be a decision left up to developers.

Thoughts?

Cheers,
-- 
Doug Turnbull
Search & Big Data Architect
OpenSource Connections <http://o19s.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message