lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Smiley <david.w.smi...@gmail.com>
Subject Re: UnifiedHighlighter and extraction of exact hit offset ranges
Date Wed, 11 Jan 2017 16:46:41 GMT
Dawid,

I'm guessing what you're seeing is from browsing the 6.3 code. The
extensibility has been improved and committed for 6.4; see CHANGES.txt and
LUCENE-7559 which did it.  In particular, all Passage methods are now
public.

I agree that OffsetsEnum methods should be public so that someone could
override FieldHighlighter#highlightOffsetsEnums usefully. This is an
oversight; good catch!  We should further
enhance TestUnifiedHighlighterExtensibility to help us check for this.
I'll file an issue.  Come to think of it... one could argue LUCENE-7559
isn't really done as it's scope should have included OffsetsEnums methods.

*Jim:* can I change some visibility there for getting this into 6.4 as part
of the same issue?  Very low risk of course.  If not; no big deal.

~ David

On Wed, Jan 11, 2017 at 8:37 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> Thanks David!
>
> That's almost exactly what I ended up doing. I don't mind casting
> Object to my own type; you can always make it a covariant override in
> your subclass (which you have to do to access those expert-level
> methods anyway).
>
> I still kind of think startOffset/endOffset and other related methods
> could be made public to allow tinkering with them in
> FieldHighlighter#highlightOffsetsEnums (otherwise this method is
> protected for overriding, but useless in practice).
>
> There is another API problem I found too. If you wish to override
> FieldHighlighter.getSummaryPassagesNoHighlight you can't return
> anything sensible because Passage is final, contains only
> package-private fields and addMatch is package-private too. So you
> can't create a "custom" passage.
>
> I can file an issue and provide a patch if these changes are not
> against the design of the unified highlighter?
>
> Dawid
>
> On Wed, Jan 11, 2017 at 2:24 PM, David Smiley <david.w.smiley@gmail.com>
> wrote:
> > Hi Dawid,
> >
> > You could write a trivial PassageFormatter that simply returns the
> Passage
> > list instead of doing formatting.  Passages contain offsets. And yes,
> > WholeBreakIterator if you don't need passage fragmentation. Unless I'm
> > missing some aspect of your requirements, this doesn't involve any
> internal
> > highlighter customizing.  Perhaps Javadocs could be improved to make this
> > more clear... and perhaps this Passage-returning PassageFormatter could
> be
> > included to clarify how it's done.  I recall doing or seeing this
> recently
> > months ago but I'm not sure.
> >
> > One ugly aspect of the API (shared with it's PostingsHighlighter lineage)
> > related to this discussion is that the PassageFormatter is declared to
> > return Object.  It's kinda hard to rectify it to be typed, perhaps with
> > generics, while also not spilling lots of generics to other places (the
> UH
> > itself) just because of this.  Perhaps UH.highlightFieldsAsObjects()
> could
> > be modified to take a Class to thus provide a type for the output... and
> > maybe the PassageFormatter could declare not only with generics but with
> a
> > method what types of results it produces.  I'm curious what you think.
> >
> > ~ David
> >
> >
> > On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
> >>
> >> To follow-up: I hacked into the offsets by passing WholeBreakIterator
> >> and a custom PassageFormatter that just returns the matches from the
> >> singleton resulting passage. This is suboptimal though, as there's
> >> still some complex logic going on in highlightOffsetsEnums that could
> >> be avoided.
> >>
> >> Dawid
> >>
> >> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss <dawid.weiss@gmail.com>
> >> wrote:
> >> > Can any of the folks who contributed to UnifiedHighlighter (David?)
> >> > clarify my thinking here?
> >> >
> >> > I have a requirement to extract (for a set of search results) a list
> >> > of exact "hit" ranges (field offsets, with support for multi-term
> >> > queries and span queries). Obviously, I'm only talking about queries
> >> > that relate to field content somehow, but this has always been quite
> >> > problematic and required the use of multiple helper classes
> >> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
> >> > hairy logic.
> >> >
> >> > So I turned to look at UnifiedHighlighter for help.
> >> >
> >> > Seems like the right way (?) to do it would be to override (and abuse)
> >> > UnifiedHighlighter's getFieldHighlighter method and return a field
> >> > highlighter with an override of:
> >> >
> >> > protected Passage[] highlightOffsetsEnums(List<OffsetsEnum>
> >> > offsetsEnums) throws IOException {
> >> >
> >> > so that I can capture and return a separate Passage for each
> >> > OffsetsEnum (I have my own code to deal with overlaps and merging, so
> >> > I can skip this entirely). Then, with a custom no-op PassageFormatter
> >> > I could simply get a list of those offsets.
> >> >
> >> > The problem with this approach is that there is currently no way to
> >> > access offsets in OffsetsEnum -- everything is protected (so
> >> > subclassable), but OffsetsEnum are closed to package-private scope.
> >> > Namely these two:
> >> >
> >> >   int startOffset() throws IOException {
> >> >     return postingsEnum.startOffset();
> >> >   }
> >> >
> >> >   int endOffset() throws IOException {
> >> >     return postingsEnum.endOffset();
> >> >   }
> >> >
> >> > Should these two be protected to allow such customizations (I agree
> >> > it's *very* low-level, but I have a practical use case where this
> >> > would be useful).
> >> >
> >> > Am I on the right track here?
> >> >
> >> > Separately from that, I think it'd be nice to have some sort of
> >> > generic utility that, for a given document (or a set of documents)
> >> > would return such hit ranges... UnifiedHighlighter seems
> >> >
> >> > Dawid
> >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Mime
View raw message