lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Advice on Stemming in Solr
Date Sat, 04 Nov 2017 15:28:59 GMT
Hi Emir,

We are looking at the configuration, to try to adjust the rules to suit our
use case.

Regards,
Edwin


On 3 November 2017 at 16:24, Emir Arnautović <emir.arnautovic@sematext.com>
wrote:

> Hi Edwin,
> Hunspell is configurable, language independent library and you can define
> any morphology rules. It’s beed there for a while and I would not be
> surprised if someone already adjusted english rules to suite you case.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Nov 2017, at 04:25, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> wrote:
> >
> > Hi Emir,
> >
> > We are looking to change to HunspellStemFilterFactory. This has a
> > dictionary file containing words and applicable flags, and an affix file
> > that specifies how these flags will control spell checking.
> > Probably we can control it from those files in HunspellStemFilterFactory?
> >
> > Regards,
> > Edwin
> >
> >
> > On 2 November 2017 at 17:46, Emir Arnautović <
> emir.arnautovic@sematext.com>
> > wrote:
> >
> >> Hi Edwin,
> >> It seems that it would be best if you do not apply *ing stemming rule at
> >> all. The first idea is to trick stemmer and replace any word that ends
> with
> >> ing to some nonexisting char combination e.g. ‘wqx’. You can use solr.
> PatternReplaceFilterFactory
> >> to do that. You can switch it back after stemming if want to have proper
> >> token in index.
> >>
> >> HTH,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 2 Nov 2017, at 03:23, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> >> wrote:
> >>>
> >>> Hi Emir,
> >>>
> >>> We do have quite alot of words that should not be stemmed. Currently,
> the
> >>> KStemFilterFactory are stemming all the non-English words that end with
> >>> "ing" as well. There are quite alot of places and names which ends in
> >>> "ing", and all these are being stemmed as well, which leads to an
> >>> inaccurate search.
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>>
> >>> On 1 November 2017 at 18:20, Emir Arnautović <
> >> emir.arnautovic@sematext.com>
> >>> wrote:
> >>>
> >>>> Hi Edwin,
> >>>> If the number of words that should not be stemmed is not high you
> could
> >>>> use KeywordMarkerFilterFactory to flag those words as keywords and it
> >>>> should prevent stemmer from changing them.
> >>>> Depending on what you want to achieve, you might not be able to avoid
> >>>> using stemmer at indexing time. If you want to find documents that
> >> contain
> >>>> only “walking” with search term “walk”, then you have to stem
at index
> >>>> time. Cases when you use stemming on query time only are rare and
> >> specific.
> >>>> If you want to prefer exact matches over stemmed matches, you have to
> >>>> index same content with and without stemming and boost matches on
> field
> >>>> without stemming.
> >>>>
> >>>> HTH,
> >>>> Emir
> >>>> --
> >>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>>
> >>>>
> >>>>
> >>>>> On 1 Nov 2017, at 10:11, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> We are currently using KStemFilterFactory in Solr, but we found
that
> it
> >>>> is
> >>>>> actually doing stemming on non-English words like "ximenting", which
> it
> >>>>> stem to "ximent". This is not what we wanted.
> >>>>>
> >>>>> Another option is to use the HunspellStemFilterFactory, but there
are
> >>>> some
> >>>>> English words like "running", walking" that are not being stemmed.
> >>>>>
> >>>>> Would like to check, is it advisable to use Stemming at index? Or
we
> >>>> should
> >>>>> not use Stemming at index time, but at query time, do a search for
> the
> >>>>> stemmed words as well, like for example, if the user search for
> >>>> "walking",
> >>>>> we will do the search together with "walk", and the actual word
of
> >>>> walking
> >>>>> will have higher weightage.
> >>>>>
> >>>>> I'm currently using Solr 6.5.1.
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message