lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From కామేశ్వర రావు భైరవభట్ల <kamesh...@gmail.com>
Subject Re: Search for misspelled words in corpus
Date Mon, 10 Jun 2013 04:59:00 GMT
Thanks everyone for the replies. I too had the same idea of a
pre-processing step. So, I first analyzed the corpus using a dictionary and
got all the misspelled words and created a separate index with those words
in Solr. Now, when I search for a given query word, first I search for the
exact match in the original index (created out of the text) and then a
fuzzy search on the index of misspelled words. This way it is giving more
accurate results. However, there is still issue with some proper nouns
(like say "Angie" showing up as a misspelled word and it gets matched with
a word like "Anger" in the fuzzy search). But I think the precision is good
enough for us.
I wanted to confirm that there is no other  in-built way in Solr to do this.

regards,
Kamesh

On Sun, Jun 9, 2013 at 10:40 PM, Jagdish Nomula <jagdish@simplyhired.com>wrote:

> ngrams will definitely increase the index. But the increase in size might
> not be super high as the total possible set of dictionary size is 26^3 and
> we are just storing docs list with each ngram.
>
> Another variation of the above ideas would be to add a pre-processing step,
> where-in you analyze the input corpus to explore the words which can be
> mis-spelt. You can use any of the word based LSH algorithms to do this and
> then index selectlively.
>
> This is a theoretical answer. You would have to cherry pick
> solutions/approaches for your use case.
>
> Thanks,
>
>
>
>
> On Sat, Jun 8, 2013 at 11:49 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
> > Hm, I was purposely avoiding mentioning ngrams because just ngramming
> > all indexed tokens would balloon the index.... My assumption was that
> > only *some* words are misspelled, in which case it may be better not
> > to ngram all tokens....
> >
> > Otis
> > --
> > Solr & ElasticSearch Support
> > http://sematext.com/
> >
> >
> >
> >
> >
> > On Sun, Jun 9, 2013 at 2:30 AM, Jagdish Nomula <jagdish@simplyhired.com>
> > wrote:
> > > Another theoretical answer for this question is ngrams approach. You
> can
> > > index the word and its trigrams. Query the index, by the string as well
> > as
> > > its trigrams, with a % match search. You than pass the exhaustive
> > resultset
> > > through a more expensive scoring such as Smith Waterman.
> > >
> > > Thanks,
> > >
> > > Jagdish
> > >
> > >
> > > On Sat, Jun 8, 2013 at 11:03 PM, Shashi Kant <skant@sloan.mit.edu>
> > wrote:
> > >
> > >> n-grams might help, followed by a edit distance metric such as
> > Jaro-Winkler
> > >> or Smith-Waterman-Gotoh to further filter out.
> > >>
> > >>
> > >> On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic <
> > >> otis.gospodnetic@gmail.com
> > >> > wrote:
> > >>
> > >> > Interesting problem.  The first thing that comes to mind is to do
> > >> > "word expansion" during indexing.  Kind of like synonym expansion,
> but
> > >> > maybe a bit more dynamic. If you can have a dictionary of correctly
> > >> > spelled words, then for each token emitted by the tokenizer you
> could
> > >> > look up the dictionary and expand the token to all other words that
> > >> > are similar/close enough.  This would not be super fast, and you'd
> > >> > likely have to add some custom heuristic for figuring out what
> > >> > "similar/close enough" means, but it might work.
> > >> >
> > >> > I'd love to hear other ideas...
> > >> >
> > >> > Otis
> > >> > --
> > >> > Solr & ElasticSearch Support
> > >> > http://sematext.com/
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు
భైరవభట్ల
> > >> > <kameshbhr@gmail.com> wrote:
> > >> > > Hi,
> > >> > >
> > >> > > I have a problem where our text corpus on which we need to do
> search
> > >> > > contains many misspelled words. Same word could also be misspelled
> > in
> > >> > > several different ways. It could also have documents that have
> > correct
> > >> > > spellings However, the search term that we give in query would
> > always
> > >> be
> > >> > > correct spelling. Now when we search on a term, we would like
to
> get
> > >> all
> > >> > > the documents that contain both correct and misspelled forms
of
> the
> > >> > search
> > >> > > term.
> > >> > > We tried fuzzy search, but it doesn't work as per our
> expectations.
> > It
> > >> > > returns any close match, not specifically misspelled words. For
> > >> example,
> > >> > if
> > >> > > I'm searching for a word like "fight", I would like to return
the
> > >> > documents
> > >> > > that have words like "figth" and "feight", not documents with
> words
> > >> like
> > >> > > "sight" and "light".
> > >> > > Is there any suggested approach for doing this?
> > >> > >
> > >> > > regards,
> > >> > > Kamesh
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > ***Jagdish Nomula*
> > > Sr. Manager Search
> > > Simply Hired, Inc.
> > > 370 San Aleso Ave., Ste 200
> > > Sunnyvale, CA 94085
> > >
> > > office - 408.400.4700
> > > cell - 408.431.2916
> > > email - jagdish@simplyhired.com <youremail@simplyhired.com>
> > >
> > > www.simplyhired.com
> >
>
>
>
> --
> ***Jagdish Nomula*
> Sr. Manager Search
> Simply Hired, Inc.
> 370 San Aleso Ave., Ste 200
> Sunnyvale, CA 94085
>
> office - 408.400.4700
> cell - 408.431.2916
> email - jagdish@simplyhired.com <youremail@simplyhired.com>
>
> www.simplyhired.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message