lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Smiley <david.w.smi...@gmail.com>
Subject Re: querying vs. highlighting: complete freedom?
Date Mon, 02 Apr 2018 14:15:12 GMT
Hi Arturas,

Both Erick and I had a go at improving the documentation here.  I hope it's
clearer.
https://builds.apache.org/job/Solr-reference-guide-master/javadoc/highlighting.html
The docs for hl.fl, hl.q, hl.qparser were all updated.  The meat of the
change was a new note in hl.fl including an example.  It's kinda hard to
document the problem you found but I hope the note will be somewhat
illustrative.

~ David

On Mon, Mar 26, 2018 at 3:12 AM Arturas Mazeika <mazeika@gmail.com> wrote:

> Hi Erick,
>
> Adding a field-qualify to the hl.q parameter solved the issue. My
> excitement is steaming over the roof! What a thorough answer: the
> explanation about the behavior of solr, how it tries to interpret what I
> mean when I supply a keyword without the field-qualifier. Very impressive.
> Would you care (re)posting this answer to stackoverflow? If that is too
> much of a hassle, I'll do this in a couple of days myself on your behalf.
>
> I am impressed how well, thorough, fast and fully the question was
> answered.
>
> Steven hint pushed me into this direction further: he suggested to use the
> query part of solr to filter and sort out the relevant answers in the 1st
> step and in the 2nd step he'd highlight all the keywords using CTR+F (in
> the browser or some alternative viewer). This brought be to the next
> question:
>
> How can one match query terms with the analyze-chained documents in an
> efficient and distributed manner? My current understanding how to achieve
> this is the following:
>
> 1. Get the list of ids (contents) of the documents that match the query
> 2. Use the http://localhost:8983/solr/#/trans/analysis to re-analyze the
> document and the query
> 3. Use the matching of the substrings from the original text to last
> filter/tokenizer/analyzer in the analyze-chain to map the terms of the
> query
> 4. Emulate CTRL+F highlighting
>
> Web Interface of Solr offers quite a bit to advance towards this goal. If
> one fires this request:
>
> * analysis.fieldvalue=Albert Einstein (14 March 1879 – 18 April 1955) was a
> German-born theoretical physicist[5] who developed the theory of
> relativity, one of the two pillars of modern physics (alongside quantum
> mechanics).&
> * analysis.query=reletivity theory
>
> to one of the cores of solr, one gets the steps 1-3 done:
>
>
> http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=xml&analysis.showmatch=true&analysis.fieldvalue=Albert%20Einstein%20(14%20March%201879%20%E2%80%93%2018%20April%201955)%20was%20a%20German-born%20theoretical%20physicist[5]%20who%20developed%20the%20theory%20of%20relativity,%20one%20of%20the%20two%20pillars%20of%20modern%20physics%20(alongside%20quantum%20mechanics).&analysis.query=reletivity%20theory&analysis.fieldtype=text_en
>
> Questions:
>
> 1. Is there a way to "load-balance" this? In the above url, I need to
> specify a specific core. Is it possible to generalize it, so the core that
> receives the request is not necessarily the one that processes it? Or this
> already is distributed in a sense that receiving core and processing cores
> are never the same?
>
> 2. The document was already analyze-chained. Is is possible to store this
> information so one does not need to re-analyze-chain it once more?
>
> Cheers
> Arturas
>
> On Fri, Mar 23, 2018 at 9:15 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> > Arturas:
> >
> > Try to field-qualify your hl.q parameter. That looks like:
> >
> > hl.q=trans:Kundigung
> > or
> > hl.q=trans:Kündigung
> >
> > I saw the exact behavior you describe when I did _not_ specify the
> > field in the hl.q parameter, i.e.
> >
> > hl.q=Kundigung
> > or
> > hl.q=Kündigung
> >
> > didn't show all highlights.
> >
> > But when I did specify the field, it worked.
> >
> > Here's what I think is happening: Solr uses the default search
> > field when parsing an un-field-qualified query. I.e.
> >
> > q=something
> >
> > is parsed as
> >
> > q=default_search_field:something.
> >
> > The default field is controlled in solrconfig.xml with the "df"
> > parameter, you'll see entries like:
> > <str name="df">my_field</str>
> >
> > Also when I changed the "df" parameter to the field I was highlighting
> > on, I didn't need to specify the field on the hl.q parameter.
> >
> > hl.q=Kundigung
> > or
> > hl.q=Kündigung
> >
> > The default  field is usually "text", which knows nothing about
> > the German-specific filters you've applied unless you changed it.
> >
> > So in the absence of a field-qualification for the hl.q parameter Solr
> > was parsing the query according to the analysis chain specifed
> > in your default field, and probably passed ü through without
> > transforming it. Since your indexing analysis chain for that field
> > folded ü to just plain u, it wasn't found or highlighted.
> >
> > On the surface, this does seem like something that should be
> > changed, I'll go ahead and ping the dev list.
> >
> > NOTE: I was trying this on Solr 7.1
> >
> > Best,
> > Erick
> >
> > On Fri, Mar 23, 2018 at 12:03 PM, Arturas Mazeika <mazeika@gmail.com>
> > wrote:
> > > Hi Erick,
> > >
> > > Thanks for the update and the infos. Your post brought quite a bit of
> > light
> > > into the picture and now I understand quite a bit more about what you
> are
> > > saying. Your explanation makes sense and can be quite useful in certain
> > > scenarious.
> > >
> > > What stroke me from your description is that you are saying that the
> > > analyzer-chain needs to be applied for the highlighting queries as
> well.
> > > The tragedy is that I am not able to get this for a german collection:
> if
> > > the query is set (no explicit highlighting query), the highlighting is
> > > correct. It is also correct, if I replace the umaults into the
> > > corresponding latin chars. Getting the analyzer chain for the
> > highlighting
> > > terms remains the challenge.
> > >
> > > Do you think you have a look at the following stakoverflow link? Maybe
> > > something comes to your mind...
> > >
> > > *https://stackoverflow.com/questions/49276093/solr-
> > highlighting-terms-with-umlaut-not-found-not-highlighted
> > > <https://stackoverflow.com/questions/49276093/solr-
> > highlighting-terms-with-umlaut-not-found-not-highlighted>*
> > >
> > > *Cheers,*
> > >
> > > *Arturas*
> > > On Fri, Mar 23, 2018, 17:43 Erick Erickson <erickerickson@gmail.com>
> > wrote:
> > >
> > >> bq: this is not a typical case that one searches for a keyword but
> > >> highlights something else
> > >>
> > >> This isn't really an unusual case, apparently I mislead you.
> > >>
> > >> What I was trying to convey is that the analysis chain used is firmly
> > >> attached to a particular _field_. There's no way to say "use one
> > >> analysis chain for the query and another for highlighting on the
> > >> _same_ field".
> > >>
> > >> You can use two different fields with different analysis chains, one
> > >> for each purpose. So something like
> > >>
> > >> q=f1:something&hl.fl=f2,f3&hl.q=other
> > >>
> > >> is certainly reasonable. It'll search for "something" in f1, and
> > >> highlight "other" in f2 and f3
> > >>
> > >> Each fields processes its input with the analysis chain defined in the
> > >> schema.
> > >>
> > >> The rest about stored="true" can be ignored, it's just me wandering
> > >> off into the weeds about an optimization that only stores the data
> > >> once rather than redundantly in multiple fields.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Fri, Mar 23, 2018 at 4:37 AM, Arturas Mazeika <mazeika@gmail.com>
> > >> wrote:
> > >> > Hi Mathesis (Stefan),
> > >> >
> > >> > Thanks for the questions. This made me look at the problem from a
> > >> distance
> > >> > and re-frame the situation. Good questions indeed.
> > >> >
> > >> > Trying to go around: consider a user who describes herself as being
> a
> > BMW
> > >> > fan, being convinced that all BMW need to be the blackest color
> > possible
> > >> > (for a sake of argument) who would like to search and later browse
> the
> > >> > entries in the discussion forum (of course not everything but BMW
of
> > the
> > >> > blackest color), and what interest her are the snippets that have
> > >> > understood, craziest as keywords or the like (because she is looking
> > for
> > >> a
> > >> > dozen of discussions that she saw before).
> > >> >
> > >> > What I was not able to achieve so far is: (i) combine query term for
> > >> > filtering and highlighting, (ii) using the analyzer-chain from the
> > >> > attribute to rewrite the highlight query (or define one in the
> search)
> > >> >
> > >> > CTR+F technique is a very powerful one, indeed. Works most of the
> > time.
> > >> The
> > >> > difficulties with it are query rewriting, enriching, etc.
> > >> >
> > >> > Cheers,
> > >> > Arturas
> > >> >
> > >> > On Fri, Mar 23, 2018 at 11:29 AM, Stefan Matheis <
> > >> matheis.stefan@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Perhaps we try it the other way round .. what's your use case
for
> > this?
> > >> I'm
> > >> >> trying to think of a situation where I'd need this a as user?
> > >> >>
> > >> >> The only reason I see myself doing this is CTRL+F in a page when
> the
> > >> search
> > >> >> result is not  immediately visible for me ;)
> > >> >>
> > >> >> On Mar 23, 2018 9:41 AM, "Arturas Mazeika" <mazeika@gmail.com>
> > wrote:
> > >> >>
> > >> >> > Hi Erick et al,
> > >> >> >
> > >> >> > From your answer I understand that this is not a typical
case
> that
> > one
> > >> >> > searches for a keyword but highlights something else. Since
we
> have
> > >> two
> > >> >> > parameters (q vs hl.q) I thought they are freely combinable.
From
> > your
> > >> >> > answer I understand that this is not really the case. My
current
> > >> >> > understanding came from [1] that says:
> > >> >> >
> > >> >> > hl.q
> > >> >> >
> > >> >> > A query to use for highlighting. This parameter allows you
to
> > >> highlight
> > >> >> > different terms than those being used to retrieve documents.
> > >> >> > what I hear from you is something different: i.e., that this
is
> not
> > >> >> enough
> > >> >> > just to combine the q with hl.q, that there are caveats to
> achieve
> > the
> > >> >> task
> > >> >> > (multiple fields, FastVectorHighlighter).
> > >> >> >
> > >> >> > Your infos are very helpful.
> > >> >> >
> > >> >> > Cheers,
> > >> >> > Arturas
> > >> >> >
> > >> >> > [1]  https://lucene.apache.org/solr/guide/7_2/highlighting.html
> > >> >> >
> > >> >> > On Thu, Mar 22, 2018 at 4:07 PM, Erick Erickson <
> > >> erickerickson@gmail.com
> > >> >> >
> > >> >> > wrote:
> > >> >> >
> > >> >> > > Basically you need to use a copyField, but in several
variants:
> > >> >> > >
> > >> >> > > If you use the field _exclusively_ for highlighting
then store
> > the
> > >> raw
> > >> >> > > content there and have the field use whatever analyzer
you
> want.
> > You
> > >> >> > > do _not_ need to have indexed="true" set for the field
if
> you're
> > >> >> > > highlighting on the fly. So you're searching against
field1
> > (which
> > >> has
> > >> >> > > indexed="true" stored="false" set) but highlighting
against
> > field2
> > >> >> > > (which has indexed="false" stored="true" set). Of course
any
> time
> > >> you
> > >> >> > > want to return the contents in a doc your fl needs to
specify
> > >> >> > > field2...
> > >> >> > >
> > >> >> > > The above does not bloat your index at all since the
cost of
> > >> >> > > stored="true" indexed="true" is the same as if you use
two
> > fields,
> > >> >> > > each with only one option turned on.
> > >> >> > >
> > >> >> > > The second approach if you want to use FastVectorHighlighter
or
> > the
> > >> >> > > like is simply to index both fields.
> > >> >> > >
> > >> >> > > Best,
> > >> >> > > Erick
> > >> >> > >
> > >> >> > > On Thu, Mar 22, 2018 at 2:18 AM, Arturas Mazeika <
> > mazeika@gmail.com
> > >> >
> > >> >> > > wrote:
> > >> >> > > > Hi Solr-Users,
> > >> >> > > >
> > >> >> > > > I've been playing with a german collection of documents,
> where
> > I
> > >> >> tried
> > >> >> > to
> > >> >> > > > search for one word (q=Tag) and highlighted another:
> > >> >> (hl.q=Kundigung).
> > >> >> > Is
> > >> >> > > > this a "legal" use case? My key question is how
can I tell
> solr
> > >> which
> > >> >> > > query
> > >> >> > > > analyzer to use for highlighting? Strictly speaking,
I should
> > use
> > >> >> > > > hl.q=Kündigung to conceptually look for relevant
information,
> > but
> > >> in
> > >> >> > this
> > >> >> > > > case, no highlighting is returned (as all umlauts
are left
> out
> > in
> > >> the
> > >> >> > > > index) .
> > >> >> > > >
> > >> >> > > > Additional infos:
> > >> >> > > >
> > >> >> > > > solr version: 7.2
> > >> >> > > > urls to query:
> > >> >> > > >
> > >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> > >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1
> > >> >> > > >
> > >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> > >> >> > >
> true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.snippets=3&wt=xml&rows=1
> > >> >> > > > <http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> > >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1>
> > >> >> > > >
> > >> >> > > > Managed-schema:
> > >> >> > > >
> > >> >> > > >   <fieldType name="text_de" class="solr.TextField"
> > >> >> > > positionIncrementGap="100">
> > >> >> > > >     <analyzer>
> > >> >> > > >       <tokenizer class="solr.StandardTokenizerFactory"/>
> > >> >> > > >       <filter class="solr.LowerCaseFilterFactory"/>
> > >> >> > > >       <filter class="solr.StopFilterFactory"
> format="snowball"
> > >> >> > > > words="lang/stopwords_de.txt" ignoreCase="true"/>
> > >> >> > > >       <filter class="solr.GermanNormalizationFilterFactory"/>
> > >> >> > > >       <filter class="solr.GermanLightStemFilterFactory"/>
> > >> >> > > >     </analyzer>
> > >> >> > > >   </fieldType>
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > Other additional infos:
> > >> >> > > > https://stackoverflow.com/questions/49276093/solr-
> > >> >> > > highlighting-terms-with-umlaut-not-found-not-highlighted
> > >> >> > > >
> > >> >> > > > Cheers,
> > >> >> > > > Arturas
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message