lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
Date Fri, 06 Jan 2017 15:36:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804800#comment-15804800
] 

David Smiley commented on LUCENE-7620:
--------------------------------------

bq. Though I wonder if we should also break the sentence if it's too long ? Maybe the wrapped
breakiterator could always be a sentence one and we could use a WordBreakIterator to cut sentences
that are too long ? This way it would produce snippets that are similar to the SimpleFragmenter.
It could also be done in another breakiterator on top of this one but this would make things
over complicated, I guess.

By choosing a lengthGoal on the low side; maybe "too long" will tend not to be a problem?
 Or see my TODO at the top of the file -- essentially choose the break that is closest to
the goal instead of always the first following it.  Maybe I'll add that in my next patch.

I don't think we should try to emulate SimpleFragmenter exactly.  We can do a much better
job ;-)   I like this implementation as a wrapper BreakIterator.... perhaps we'll add a Regex
BI one day and then it would simply fit right in.

bq. For the implementation can you throw an exception on the method that should not be called
? For instance ...(etc)

Yeah I could go either way on that... how about {{assert false : "not supported/expected";}}?
 

bq. Additionally I think that we should have a way to change the start and end of a passage
when we know all the match that it contains. This is what the FVH is doing and it should be
doable in the UH because the passage are created on the fly in forward manner. This is of
course not the purpose of this issue and it should be treated as a new feature but I think
it would be great to have the same output than the FVH when the max length of the passage
is set.

Definitely a separate issue.  It wouldn't fit into the BreakIterator abstraction either. 
Maybe some Passage post-processor like thing.  Or maybe simply expose sufficient hooks to
allow subclassers to do this.  That keeps the UH simpler.


> UnifiedHighlighter: add target character width BreakIterator wrapper
> --------------------------------------------------------------------
>
>                 Key: LUCENE-7620
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7620
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates fragments (aka
Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  It's useful
in its own right and of course it helps users transition to the UH.  I'd like to do it as
a wrapper to another BreakIterator -- perhaps a sentence one.  In this way you get back Passages
that are a number of sentences so they will look nice instead of breaking mid-way through
a sentence.  And you get some control by specifying a target number of characters.  This BreakIterator
wouldn't be a general purpose java.text.BreakIterator since it would assume it's called in
a manner exactly as the UnifiedHighlighter uses it.  It would probably be compatible with
the PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your BreakIterator
config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message