lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
Date Sat, 07 Jan 2017 02:21:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15806526#comment-15806526
] 

David Smiley edited comment on LUCENE-7620 at 1/7/17 2:21 AM:
--------------------------------------------------------------

bq. (Tim) For the following method, does it make sense to return the baseIter if the followingIdx
< startIndex? Maybe throw an exception instead or just have an assert that it's less?

It's already asserting (line 124).  Or I'm not understanding you.

bq.  (Tim) This is subjective, but I find it's more useful to break out the different tests
with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal,
breakGoalPlusRandom, etc. Similar for the defaultSummary tests. This helps when coming back
to the test and helps tease apart if one piece of functionality is broken vs another.

Fair point.  A better compromise in my mind that is not as verbose as your suggestion is to
use the "message" parameter of the assert methods.  I will do this and upload a new patch
tonight.

bq.  (Jim) ... but I wonder if the logic to get the boundary could not be simplified. Isn't
it possible to always invoke baseIter.preceding(targetIdx) and based on isMinimumSize return
current() or baseIter.next() ?

No; I don't think so. If one looks at {{preceding(target)}}, you don't know if it's result
is closer to the target than the following break or not.  The "target" mode of this BI gets
the _closest_ break.  Come to think of it, maybe I should rename {{createTargetLength}} to
be {{createClosestToLength}}.  At least it's javadocs are already clear I think?


was (Author: dsmiley):
bq. (Tim) For the following method, does it make sense to return the baseIter if the followingIdx
< startIndex? Maybe throw an exception instead or just have an assert that it's less?

It's already asserting (line 124).  Or I'm not understanding you.

bq.  (Tim) This is subjective, but I find it's more useful to break out the different tests
with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal,
breakGoalPlusRandom, etc. Similar for the defaultSummary tests. This helps when coming back
to the test and helps tease apart if one piece of functionality is broken vs another.

Fair point.  A better compromise in my mind that is not as verbose as your suggestion is to
use the "message" parameter of the assert methods.  I will do this and upload a new patch
tonight.

bq.  (Jim) ... but I wonder if the logic to get the boundary could not be simplified.
Isn't it possible to always invoke baseIter.preceding(targetIdx) and based on isMinimumSize
return current() or baseIter.next() ?

No; I don't think so. If one looks at {{preceding(target)}}, you don't know if it's result
is closer to the target than the following break or not.  The "target" mode of this BI gets
the _closest_ break.  Come to think of it, maybe I should rename {{createTargetLength}} to
be {{createClosestToLength}}.  At least it's javadocs are already clear I think?

> UnifiedHighlighter: add target character width BreakIterator wrapper
> --------------------------------------------------------------------
>
>                 Key: LUCENE-7620
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7620
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 6.4
>
>         Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates fragments (aka
Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  It's useful
in its own right and of course it helps users transition to the UH.  I'd like to do it as
a wrapper to another BreakIterator -- perhaps a sentence one.  In this way you get back Passages
that are a number of sentences so they will look nice instead of breaking mid-way through
a sentence.  And you get some control by specifying a target number of characters.  This BreakIterator
wouldn't be a general purpose java.text.BreakIterator since it would assume it's called in
a manner exactly as the UnifiedHighlighter uses it.  It would probably be compatible with
the PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your BreakIterator
config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message