lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Smiley <david.w.smi...@gmail.com>
Subject Re: UnifiedHighlighter and extraction of exact hit offset ranges
Date Wed, 11 Jan 2017 18:19:24 GMT
If the generics could be contained _instead of_ spreading to the UH class
itself (making UH typed), I think it could be nice.  But given the
per-field possible settings for formatting... that in particular makes
balancing these concerns hard.  I guess in the end Object isn't too bad
since it's limited to the advanced method use-case
(highlightFieldsAsObjects).

On Wed, Jan 11, 2017 at 8:45 AM Timothy Rodriguez (BLOOMBERG/ 120 PARK) <
trodriguez25@bloomberg.net> wrote:

> While we were open sourcing it. I had tried creating a patch to generify
> it, but the generics did wind up all over the place. Ultimately the
> UnifiedHighlighter would need to be generic itself so it can ensure the
> passage formatters etc are of the same type. (Or alternatively, generic
> passage formatters are passed in per request.) I wound up dumping the
> changes because they were quite substantial and they'd also push it further
> from the PostingsHighlighjer.
>
> I'm hoping to get back to trying that again in he future. It'd be nice to
> have a PassageFormatter<MyCustomObject>.
>
> -Tim
>
> Sent from Bloomberg Professional for iPhone
>
>
> ----- Original Message -----
> From: Dawid Weiss <dawid.weiss@gmail.com>
> To: david.w.smiley@gmail.com
> CC: TIMOTHY RODRIGUEZ, dev@lucene.apache.org
> At: 11-Jan-2017 08:37:37
>
> Thanks David!
>
> That's almost exactly what I ended up doing. I don't mind casting
> Object to my own type; you can always make it a covariant override in
> your subclass (which you have to do to access those expert-level
> methods anyway).
>
> I still kind of think startOffset/endOffset and other related methods
> could be made public to allow tinkering with them in
> FieldHighlighter#highlightOffsetsEnums (otherwise this method is
> protected for overriding, but useless in practice).
>
> There is another API problem I found too. If you wish to override
> FieldHighlighter.getSummaryPassagesNoHighlight you can't return
> anything sensible because Passage is final, contains only
> package-private fields and addMatch is package-private too. So you
> can't create a "custom" passage.
>
> I can file an issue and provide a patch if these changes are not
> against the design of the unified highlighter?
>
> Dawid
>
> On Wed, Jan 11, 2017 at 2:24 PM, David Smiley <david.w.smiley@gmail.com>
> wrote:
> > Hi Dawid,
> >
> > You could write a trivial PassageFormatter that simply returns the
> Passage
> > list instead of doing formatting. Passages contain offsets. And yes,
> > WholeBreakIterator if you don't need passage fragmentation. Unless I'm
> > missing some aspect of your requirements, this doesn't involve any
> internal
> > highlighter customizing. Perhaps Javadocs could be improved to make this
> > more clear... and perhaps this Passage-returning PassageFormatter could
> be
> > included to clarify how it's done. I recall doing or seeing this recently
> > months ago but I'm not sure.
> >
> > One ugly aspect of the API (shared with it's PostingsHighlighter lineage)
> > related to this discussion is that the PassageFormatter is declared to
> > return Object. It's kinda hard to rectify it to be typed, perhaps with
> > generics, while also not spilling lots of generics to other places (the
> UH
> > itself) just because of this. Perhaps UH.highlightFieldsAsObjects()
> could
> > be modified to take a Class to thus provide a type for the output... and
> > maybe the PassageFormatter could declare not only with generics but with
> a
> > method what types of results it produces. I'm curious what you think.
> >
> > ~ David
> >
> >
> > On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
> >>
> >> To follow-up: I hacked into the offsets by passing WholeBreakIterator
> >> and a custom PassageFormatter that just returns the matches from the
> >> singleton resulting passage. This is suboptimal though, as there's
> >> still some complex logic going on in highlightOffsetsEnums that could
> >> be avoided.
> >>
> >> Dawid
> >>
> >> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss <dawid.weiss@gmail.com>
> >> wrote:
> >> > Can any of the folks who contributed to UnifiedHighlighter (David?)
> >> > clarify my thinking here?
> >> >
> >> > I have a requirement to extract (for a set of search results) a list
> >> > of exact "hit" ranges (field offsets, with support for multi-term
> >> > queries and span queries). Obviously, I'm only talking about queries
> >> > that relate to field content somehow, but this has always been quite
> >> > problematic and required the use of multiple helper classes
> >> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
> >> > hairy logic.
> >> >
> >> > So I turned to look at UnifiedHighlighter for help.
> >> >
> >> > Seems like the right way (?) to do it would be to override (and abuse)
> >> > UnifiedHighlighter's getFieldHighlighter method and return a field
> >> > highlighter with an override of:
> >> >
> >> > protected Passage[] highlightOffsetsEnums(List<OffsetsEnum>
> >> > offsetsEnums) throws IOException {
> >> >
> >> > so that I can capture and return a separate Passage for each
> >> > OffsetsEnum (I have my own code to deal with overlaps and merging, so
> >> > I can skip this entirely). Then, with a custom no-op PassageFormatter
> >> > I could simply get a list of those offsets.
> >> >
> >> > The problem with this approach is that there is currently no way to
> >> > access offsets in OffsetsEnum -- everything is protected (so
> >> > subclassable), but OffsetsEnum are closed to package-private scope.
> >> > Namely these two:
> >> >
> >> > int startOffset() throws IOException {
> >> > return postingsEnum.startOffset();
> >> > }
> >> >
> >> > int endOffset() throws IOException {
> >> > return postingsEnum.endOffset();
> >> > }
> >> >
> >> > Should these two be protected to allow such customizations (I agree
> >> > it's *very* low-level, but I have a practical use case where this
> >> > would be useful).
> >> >
> >> > Am I on the right track here?
> >> >
> >> > Separately from that, I think it'd be nice to have some sort of
> >> > generic utility that, for a given document (or a set of documents)
> >> > would return such hit ranges... UnifiedHighlighter seems
> >> >
> >> > Dawid
> >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Mime
View raw message