lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5734) HTMLStripCharFilter end offset should be left of closing tags
Date Thu, 05 Jun 2014 21:54:01 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019325#comment-14019325
] 

Steve Rowe commented on LUCENE-5734:
------------------------------------

Paraphrasing my answer to David from IRC: "adjacency" doesn't fully describe the effect you're
looking for, since text is adjacent both before and after both opening and closing tags.

Semantics aside, I agree that moving offsets prior to closing tags would align better with
intuitive expectations, and would very likely reduce the number of fixups highlighters would
have to make to balance tags for any given snippet within marked up text.

My only remaining concern is whether changing the behavior will negatively affect existing
users.  Maybe we could make the behavior configurable?  If that's done, there remains the
question of whether to leave the default behavior as it is now, or make the default be the
new behavior.

> HTMLStripCharFilter end offset should be left of closing tags
> -------------------------------------------------------------
>
>                 Key: LUCENE-5734
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5734
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> Consider this simple input:
> {noformat}
> <em>hello</em>
> {noformat}
> to be analyzed by HTMLStripCharFilter and WhitespaceTokenizer.
> You get back one token for "hello".  Good.  The start offset of this token is at the
position of 'h' -- good.  But the end offset is surprisingly plus one to the adjacent </em>.
 I argue that it should be plus one to the last character of the token (following 'o').
> FYI it behaves as I expect if after hello is an XML entity such as in this example: {noformat}hello&nbsp;{noformat}
The end offset immediately follows the 'o'.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message