lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Susheel Kumar <susheel2...@gmail.com>
Subject Re: indexing two words, searching single word
Date Fri, 03 Aug 2018 12:40:28 GMT
and as you suggested, use stop word before shingles...

On Fri, Aug 3, 2018 at 8:10 AM, Clemens Wyss DEV <clemensdev@mysign.ch>
wrote:

> <analyzer type="index">
>   <tokenizer class="solr.WhitespaceTokenizerFactory" />
>   <filter class="solr.LowerCaseFilterFactory" />
>   <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true" tokenSeparator=""/> <!-- here weg go! -->
> </analyzer>
>
> seems to "work"
>
> -----Ursprüngliche Nachricht-----
> Von: Clemens Wyss DEV <clemensdev@mysign.ch>
> Gesendet: Freitag, 3. August 2018 13:46
> An: solr-user@lucene.apache.org
> Betreff: AW: indexing two words, searching single word
>
> >Because you probably are not looking for "andthe" kind of tokens
> (unfortunately) I guess I am, as we don't know what people enter...
>
> > a shingle plus regex to remove whitespace
> sounds interesting. How would that filter-chain look like? That would be
> an type="index"-analyzer?
> I guess we could shingle after stop-word-filtering and I quess
> maxShingleSize="2" would suffice
>
> -----Ursprüngliche Nachricht-----
> Von: Alexandre Rafalovitch <arafalov@gmail.com>
> Gesendet: Freitag, 3. August 2018 13:33
> An: solr-user <solr-user@lucene.apache.org>
> Betreff: Re: indexing two words, searching single word
>
> But what is your generic problem then. Because you probably are not
> looking for "andthe" kind of tokens.
>
> However a shingle plus regex to remove whitespace can give you "anytwo
> wordstogether smooshed" tokens in the index.
>
> Regards,
>      Alex
>
>
> On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV, <clemensdev@mysign.ch>
> wrote:
>
> > Hi Markus,
> > thanks for the quick answer.
> >
> > "sound stage" was just an example. We are looking for a generic
> > solution ...
> >
> > Is it "ok" to apply an NGRamFilter for query-analyzing?
> > <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >         <filter class="solr.LowerCaseFilterFactory" />
> >         <filter class="solr.NGramFilterFactory" minGramSize="3"
> > maxGramSize="15" />
> > </analyzer>
> >
> > I guess (besides the performance impact) this reduces search results
> > accuracy?
> >
> > -Clemens
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Markus Jelsma <markus.jelsma@openindex.io>
> > Gesendet: Freitag, 3. August 2018 12:43
> > An: solr-user@lucene.apache.org
> > Betreff: RE: indexing two words, searching single word
> >
> > Hello,
> >
> > If your case is English you could use synonyms to work around the
> > problem of the few compound words of the language. However, would you
> > be dealing with a Germanic compound language, the
> > HyphenationCompoundWordTokenFilter
> > [1] or DictionaryCompoundWordTokenFilter are a better choice. The
> > former is much more flexible but has its drawbacks.
> >
> > Regards,
> > Markus
> >
> >
> > https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucen
> > e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
> >
> >
> >
> > -----Original message-----
> > > From:Clemens Wyss DEV <clemensdev@mysign.ch>
> > > Sent: Friday 3rd August 2018 12:22
> > > To: solr-user@lucene.apache.org
> > > Subject: indexing two words, searching single word
> > >
> > > Sounds like a rather simple issue:
> > > if I index "sound stage" and search for "soundstage" I get no hits
> > >
> > > What am I doing wrong
> > > a) when indexing
> > > b) when searching
> > > ?
> > >
> > > Thx in advance
> > > - Clemens
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message