lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arturas Mazeika <maze...@gmail.com>
Subject Re: querying vs. highlighting: complete freedom?
Date Tue, 03 Apr 2018 10:56:28 GMT
Hi David,

Thanks a lot for the reply, the effort to update the documentation, and
have the documentation reflect the question I posted here.

I've read the doc you provided. I've read the updated parts and the the
document as carefully as I could. I've browsed and skimmed part of the
document (where it got rather detailed, especially the parts from the
original, unified and vector highlighters. I'll have to revisit those parts
as I deepen my understanding about information retrieval and solr in
particular.

The updates in the document are helpful and improved the document quite a
bit. I also agree that it is hard to document the problem and give a
solution to my problem. I see at least two reasons why this becomes very
challenging in this case: (i) the document aims to cover all options and
possibilities of highlighting in solr, (ii) the documents aims to teach the
reader how to use highlighting in solr. These aims are conflicting: If one
wants to cover the options and possibilities, one structures the content
hierarchically, starting with most basic building blocks (jumping into
details first). If one aims at usage, one starts with the simplest possible
case that illustrates highlighting, followed up by more complex use cases
illustrating more sophisticated and advanced cases (abstracts from details,
focuses on big picture). 1st type of documentation tends to be long and
boring (check out manuals provided by Microsoft, they perfected this style
of documenting in my opinion) second type of documentation repeats itself
constantly, or contains multiple references to outside (as every new use
case is somewhat based on the previous one). You have sections that focus
on both aspects in the documentation: some examples give very simple
targeted examples how to use solr, and some sections dig into the details.
What I missed at the beginning of the documentation is the minimal set of
requirements that is reacquired to have highlighting sensible: somehow I
have a feeling that one needs some of the information stored in schema in
some form. This of course is mentioned later on in the corresponding
section, but I'd write this explicitly.

I still have a question that would really be cool to get an answer (which
is more about analyses and less about highlighting). My key question is:

Is there a way to "load-balance" analyze-query-chain for the purpose of
highlighting matches? In the url below, I need to specify a specific core.

http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=xml&
analysis.showmatch=true&analysis.fieldvalue=Albert%20Einstein%20(14%20March%
201879%20%E2%80%93%2018%20April%201955)%20was%20a%
20German-born%20theoretical%20physicist[5]%20who%20developed%20the%20theory%
20of%20relativity,%20one%20of%20the%20two%20pillars%20of%
20modern%20physics%20(alongside%20quantum%20mechanics).&analysis.query=
reletivity%20theory&analysis.fieldtype=text_en


The context for this question is:

> Steven hint pushed me into this direction further: he suggested to use the
> query part of solr to filter and sort out the relevant answers in the 1st
> step and in the 2nd step he'd highlight all the keywords using CTR+F (in
> the browser or some alternative viewer). This brought be to the next
> question:
>
> How can one match query terms with the analyze-chained documents in an
> efficient and distributed manner? My current understanding how to achieve
> this is the following:
>
> 1. Get the list of ids (contents) of the documents that match the query
> 2. Use the http://localhost:8983/solr/#/trans/analysis to re-analyze the
> document and the query
> 3. Use the matching of the substrings from the original text to last
> filter/tokenizer/analyzer in the analyze-chain to map the terms of the
> query
> 4. Emulate CTRL+F highlighting
>
> Web Interface of Solr offers quite a bit to advance towards this goal. If
> one fires this request:
>
> * analysis.fieldvalue=Albert Einstein (14 March 1879 – 18 April 1955) was
a
> German-born theoretical physicist[5] who developed the theory of
> relativity, one of the two pillars of modern physics (alongside quantum
> mechanics).&
> * analysis.query=reletivity theory
>
> to one of the cores of solr, one gets the steps 1-3 done:
>
>
> http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=xml&
analysis.showmatch=true&analysis.fieldvalue=Albert%20Einstein%20(14%20March%
201879%20%E2%80%93%2018%20April%201955)%20was%20a%
20German-born%20theoretical%20physicist[5]%20who%20developed%20the%20theory%
20of%20relativity,%20one%20of%20the%20two%20pillars%20of%
20modern%20physics%20(alongside%20quantum%20mechanics).&analysis.query=
reletivity%20theory&analysis.fieldtype=text_en
>
> Questions:
>
> 1. Is there a way to "load-balance" this? In the above url, I need to
> specify a specific core. Is it possible to generalize it, so the core that
> receives the request is not necessarily the one that processes it? Or this
> already is distributed in a sense that receiving core and processing cores
> are never the same?
>
> 2. The document was already analyze-chained. Is is possible to store this
> information so one does not need to re-analyze-chain it once more?

Cheers,
Arturas

On Mon, Apr 2, 2018 at 4:15 PM, David Smiley <david.w.smiley@gmail.com>
wrote:

> Hi Arturas,
>
> Both Erick and I had a go at improving the documentation here.  I hope it's
> clearer.
> https://builds.apache.org/job/Solr-reference-guide-master/
> javadoc/highlighting.html
> The docs for hl.fl, hl.q, hl.qparser were all updated.  The meat of the
> change was a new note in hl.fl including an example.  It's kinda hard to
> document the problem you found but I hope the note will be somewhat
> illustrative.
>
> ~ David
>
> On Mon, Mar 26, 2018 at 3:12 AM Arturas Mazeika <mazeika@gmail.com> wrote:
>
> > Hi Erick,
> >
> > Adding a field-qualify to the hl.q parameter solved the issue. My
> > excitement is steaming over the roof! What a thorough answer: the
> > explanation about the behavior of solr, how it tries to interpret what I
> > mean when I supply a keyword without the field-qualifier. Very
> impressive.
> > Would you care (re)posting this answer to stackoverflow? If that is too
> > much of a hassle, I'll do this in a couple of days myself on your behalf.
> >
> > I am impressed how well, thorough, fast and fully the question was
> > answered.
> >
> > Steven hint pushed me into this direction further: he suggested to use
> the
> > query part of solr to filter and sort out the relevant answers in the 1st
> > step and in the 2nd step he'd highlight all the keywords using CTR+F (in
> > the browser or some alternative viewer). This brought be to the next
> > question:
> >
> > How can one match query terms with the analyze-chained documents in an
> > efficient and distributed manner? My current understanding how to achieve
> > this is the following:
> >
> > 1. Get the list of ids (contents) of the documents that match the query
> > 2. Use the http://localhost:8983/solr/#/trans/analysis to re-analyze the
> > document and the query
> > 3. Use the matching of the substrings from the original text to last
> > filter/tokenizer/analyzer in the analyze-chain to map the terms of the
> > query
> > 4. Emulate CTRL+F highlighting
> >
> > Web Interface of Solr offers quite a bit to advance towards this goal. If
> > one fires this request:
> >
> > * analysis.fieldvalue=Albert Einstein (14 March 1879 – 18 April 1955)
> was a
> > German-born theoretical physicist[5] who developed the theory of
> > relativity, one of the two pillars of modern physics (alongside quantum
> > mechanics).&
> > * analysis.query=reletivity theory
> >
> > to one of the cores of solr, one gets the steps 1-3 done:
> >
> >
> > http://localhost:8983/solr/trans_shard1_replica_n1/
> analysis/field?wt=xml&analysis.showmatch=true&analysis.fieldvalue=Albert%
> 20Einstein%20(14%20March%201879%20%E2%80%93%2018%
> 20April%201955)%20was%20a%20German-born%20theoretical%
> 20physicist[5]%20who%20developed%20the%20theory%
> 20of%20relativity,%20one%20of%20the%20two%20pillars%20of%
> 20modern%20physics%20(alongside%20quantum%20mechanics).&analysis.query=
> reletivity%20theory&analysis.fieldtype=text_en
> >
> > Questions:
> >
> > 1. Is there a way to "load-balance" this? In the above url, I need to
> > specify a specific core. Is it possible to generalize it, so the core
> that
> > receives the request is not necessarily the one that processes it? Or
> this
> > already is distributed in a sense that receiving core and processing
> cores
> > are never the same?
> >
> > 2. The document was already analyze-chained. Is is possible to store this
> > information so one does not need to re-analyze-chain it once more?
> >
> > Cheers
> > Arturas
> >
> > On Fri, Mar 23, 2018 at 9:15 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> > > Arturas:
> > >
> > > Try to field-qualify your hl.q parameter. That looks like:
> > >
> > > hl.q=trans:Kundigung
> > > or
> > > hl.q=trans:Kündigung
> > >
> > > I saw the exact behavior you describe when I did _not_ specify the
> > > field in the hl.q parameter, i.e.
> > >
> > > hl.q=Kundigung
> > > or
> > > hl.q=Kündigung
> > >
> > > didn't show all highlights.
> > >
> > > But when I did specify the field, it worked.
> > >
> > > Here's what I think is happening: Solr uses the default search
> > > field when parsing an un-field-qualified query. I.e.
> > >
> > > q=something
> > >
> > > is parsed as
> > >
> > > q=default_search_field:something.
> > >
> > > The default field is controlled in solrconfig.xml with the "df"
> > > parameter, you'll see entries like:
> > > <str name="df">my_field</str>
> > >
> > > Also when I changed the "df" parameter to the field I was highlighting
> > > on, I didn't need to specify the field on the hl.q parameter.
> > >
> > > hl.q=Kundigung
> > > or
> > > hl.q=Kündigung
> > >
> > > The default  field is usually "text", which knows nothing about
> > > the German-specific filters you've applied unless you changed it.
> > >
> > > So in the absence of a field-qualification for the hl.q parameter Solr
> > > was parsing the query according to the analysis chain specifed
> > > in your default field, and probably passed ü through without
> > > transforming it. Since your indexing analysis chain for that field
> > > folded ü to just plain u, it wasn't found or highlighted.
> > >
> > > On the surface, this does seem like something that should be
> > > changed, I'll go ahead and ping the dev list.
> > >
> > > NOTE: I was trying this on Solr 7.1
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Mar 23, 2018 at 12:03 PM, Arturas Mazeika <mazeika@gmail.com>
> > > wrote:
> > > > Hi Erick,
> > > >
> > > > Thanks for the update and the infos. Your post brought quite a bit of
> > > light
> > > > into the picture and now I understand quite a bit more about what you
> > are
> > > > saying. Your explanation makes sense and can be quite useful in
> certain
> > > > scenarious.
> > > >
> > > > What stroke me from your description is that you are saying that the
> > > > analyzer-chain needs to be applied for the highlighting queries as
> > well.
> > > > The tragedy is that I am not able to get this for a german
> collection:
> > if
> > > > the query is set (no explicit highlighting query), the highlighting
> is
> > > > correct. It is also correct, if I replace the umaults into the
> > > > corresponding latin chars. Getting the analyzer chain for the
> > > highlighting
> > > > terms remains the challenge.
> > > >
> > > > Do you think you have a look at the following stakoverflow link?
> Maybe
> > > > something comes to your mind...
> > > >
> > > > *https://stackoverflow.com/questions/49276093/solr-
> > > highlighting-terms-with-umlaut-not-found-not-highlighted
> > > > <https://stackoverflow.com/questions/49276093/solr-
> > > highlighting-terms-with-umlaut-not-found-not-highlighted>*
> > > >
> > > > *Cheers,*
> > > >
> > > > *Arturas*
> > > > On Fri, Mar 23, 2018, 17:43 Erick Erickson <erickerickson@gmail.com>
> > > wrote:
> > > >
> > > >> bq: this is not a typical case that one searches for a keyword but
> > > >> highlights something else
> > > >>
> > > >> This isn't really an unusual case, apparently I mislead you.
> > > >>
> > > >> What I was trying to convey is that the analysis chain used is
> firmly
> > > >> attached to a particular _field_. There's no way to say "use one
> > > >> analysis chain for the query and another for highlighting on the
> > > >> _same_ field".
> > > >>
> > > >> You can use two different fields with different analysis chains, one
> > > >> for each purpose. So something like
> > > >>
> > > >> q=f1:something&hl.fl=f2,f3&hl.q=other
> > > >>
> > > >> is certainly reasonable. It'll search for "something" in f1, and
> > > >> highlight "other" in f2 and f3
> > > >>
> > > >> Each fields processes its input with the analysis chain defined in
> the
> > > >> schema.
> > > >>
> > > >> The rest about stored="true" can be ignored, it's just me wandering
> > > >> off into the weeds about an optimization that only stores the data
> > > >> once rather than redundantly in multiple fields.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >> On Fri, Mar 23, 2018 at 4:37 AM, Arturas Mazeika <mazeika@gmail.com
> >
> > > >> wrote:
> > > >> > Hi Mathesis (Stefan),
> > > >> >
> > > >> > Thanks for the questions. This made me look at the problem from
a
> > > >> distance
> > > >> > and re-frame the situation. Good questions indeed.
> > > >> >
> > > >> > Trying to go around: consider a user who describes herself as
> being
> > a
> > > BMW
> > > >> > fan, being convinced that all BMW need to be the blackest color
> > > possible
> > > >> > (for a sake of argument) who would like to search and later browse
> > the
> > > >> > entries in the discussion forum (of course not everything but
BMW
> of
> > > the
> > > >> > blackest color), and what interest her are the snippets that
have
> > > >> > understood, craziest as keywords or the like (because she is
> looking
> > > for
> > > >> a
> > > >> > dozen of discussions that she saw before).
> > > >> >
> > > >> > What I was not able to achieve so far is: (i) combine query term
> for
> > > >> > filtering and highlighting, (ii) using the analyzer-chain from
the
> > > >> > attribute to rewrite the highlight query (or define one in the
> > search)
> > > >> >
> > > >> > CTR+F technique is a very powerful one, indeed. Works most of
the
> > > time.
> > > >> The
> > > >> > difficulties with it are query rewriting, enriching, etc.
> > > >> >
> > > >> > Cheers,
> > > >> > Arturas
> > > >> >
> > > >> > On Fri, Mar 23, 2018 at 11:29 AM, Stefan Matheis <
> > > >> matheis.stefan@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> >> Perhaps we try it the other way round .. what's your use
case for
> > > this?
> > > >> I'm
> > > >> >> trying to think of a situation where I'd need this a as user?
> > > >> >>
> > > >> >> The only reason I see myself doing this is CTRL+F in a page
when
> > the
> > > >> search
> > > >> >> result is not  immediately visible for me ;)
> > > >> >>
> > > >> >> On Mar 23, 2018 9:41 AM, "Arturas Mazeika" <mazeika@gmail.com>
> > > wrote:
> > > >> >>
> > > >> >> > Hi Erick et al,
> > > >> >> >
> > > >> >> > From your answer I understand that this is not a typical
case
> > that
> > > one
> > > >> >> > searches for a keyword but highlights something else.
Since we
> > have
> > > >> two
> > > >> >> > parameters (q vs hl.q) I thought they are freely combinable.
> From
> > > your
> > > >> >> > answer I understand that this is not really the case.
My
> current
> > > >> >> > understanding came from [1] that says:
> > > >> >> >
> > > >> >> > hl.q
> > > >> >> >
> > > >> >> > A query to use for highlighting. This parameter allows
you to
> > > >> highlight
> > > >> >> > different terms than those being used to retrieve documents.
> > > >> >> > what I hear from you is something different: i.e., that
this is
> > not
> > > >> >> enough
> > > >> >> > just to combine the q with hl.q, that there are caveats
to
> > achieve
> > > the
> > > >> >> task
> > > >> >> > (multiple fields, FastVectorHighlighter).
> > > >> >> >
> > > >> >> > Your infos are very helpful.
> > > >> >> >
> > > >> >> > Cheers,
> > > >> >> > Arturas
> > > >> >> >
> > > >> >> > [1]  https://lucene.apache.org/solr/guide/7_2/highlighting.
> html
> > > >> >> >
> > > >> >> > On Thu, Mar 22, 2018 at 4:07 PM, Erick Erickson <
> > > >> erickerickson@gmail.com
> > > >> >> >
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> > > Basically you need to use a copyField, but in several
> variants:
> > > >> >> > >
> > > >> >> > > If you use the field _exclusively_ for highlighting
then
> store
> > > the
> > > >> raw
> > > >> >> > > content there and have the field use whatever analyzer
you
> > want.
> > > You
> > > >> >> > > do _not_ need to have indexed="true" set for the
field if
> > you're
> > > >> >> > > highlighting on the fly. So you're searching against
field1
> > > (which
> > > >> has
> > > >> >> > > indexed="true" stored="false" set) but highlighting
against
> > > field2
> > > >> >> > > (which has indexed="false" stored="true" set).
Of course any
> > time
> > > >> you
> > > >> >> > > want to return the contents in a doc your fl needs
to specify
> > > >> >> > > field2...
> > > >> >> > >
> > > >> >> > > The above does not bloat your index at all since
the cost of
> > > >> >> > > stored="true" indexed="true" is the same as if
you use two
> > > fields,
> > > >> >> > > each with only one option turned on.
> > > >> >> > >
> > > >> >> > > The second approach if you want to use FastVectorHighlighter
> or
> > > the
> > > >> >> > > like is simply to index both fields.
> > > >> >> > >
> > > >> >> > > Best,
> > > >> >> > > Erick
> > > >> >> > >
> > > >> >> > > On Thu, Mar 22, 2018 at 2:18 AM, Arturas Mazeika
<
> > > mazeika@gmail.com
> > > >> >
> > > >> >> > > wrote:
> > > >> >> > > > Hi Solr-Users,
> > > >> >> > > >
> > > >> >> > > > I've been playing with a german collection
of documents,
> > where
> > > I
> > > >> >> tried
> > > >> >> > to
> > > >> >> > > > search for one word (q=Tag) and highlighted
another:
> > > >> >> (hl.q=Kundigung).
> > > >> >> > Is
> > > >> >> > > > this a "legal" use case? My key question is
how can I tell
> > solr
> > > >> which
> > > >> >> > > query
> > > >> >> > > > analyzer to use for highlighting? Strictly
speaking, I
> should
> > > use
> > > >> >> > > > hl.q=Kündigung to conceptually look for relevant
> information,
> > > but
> > > >> in
> > > >> >> > this
> > > >> >> > > > case, no highlighting is returned (as all
umlauts are left
> > out
> > > in
> > > >> the
> > > >> >> > > > index) .
> > > >> >> > > >
> > > >> >> > > > Additional infos:
> > > >> >> > > >
> > > >> >> > > > solr version: 7.2
> > > >> >> > > > urls to query:
> > > >> >> > > >
> > > >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> > > >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1
> > > >> >> > > >
> > > >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> > > >> >> > >
> > true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.snippets=3&wt=xml&rows=1
> > > >> >> > > > <http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> > > >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1>
> > > >> >> > > >
> > > >> >> > > > Managed-schema:
> > > >> >> > > >
> > > >> >> > > >   <fieldType name="text_de" class="solr.TextField"
> > > >> >> > > positionIncrementGap="100">
> > > >> >> > > >     <analyzer>
> > > >> >> > > >       <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >> >> > > >       <filter class="solr.LowerCaseFilterFactory"/>
> > > >> >> > > >       <filter class="solr.StopFilterFactory"
> > format="snowball"
> > > >> >> > > > words="lang/stopwords_de.txt" ignoreCase="true"/>
> > > >> >> > > >       <filter class="solr.GermanNormalizationFilterFacto
> ry"/>
> > > >> >> > > >       <filter class="solr.GermanLightStemFilterFactory"/>
> > > >> >> > > >     </analyzer>
> > > >> >> > > >   </fieldType>
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > Other additional infos:
> > > >> >> > > > https://stackoverflow.com/questions/49276093/solr-
> > > >> >> > > highlighting-terms-with-umlaut-not-found-not-highlighted
> > > >> >> > > >
> > > >> >> > > > Cheers,
> > > >> >> > > > Arturas
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message