lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benson Margulies" <bimargul...@gmail.com>
Subject Re: Language support
Date Thu, 20 Mar 2008 16:45:18 GMT
Token/by/token seems a bit extreme. Are you concerned with macaronic
documents?

On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <wunderwood@netflix.com>
wrote:

> Nice list.
>
> You may still need to mark the language of each document. There are
> plenty of cross-language collisions: "die" and "boot" have different
> meanings in German and English. Proper nouns ("Laserjet") may be the
> same in all languages, a different problem if you are trying to get
> answers in one language.
>
> At one point, I considered using Unicode language tagging on each
> token to keep it all straight. Effectively, index "de/Boot" or
> "en/Laserjet".
>
> wunder
>
> On 3/20/08 9:20 AM, "Benson Margulies" <bimargulies@gmail.com> wrote:
>
> > Unless you can come up with language-neutral tokenization and stemming,
> > you
> need to:
> >
> > a) know the language of each document.
> > b) run a different
> > analyzer depending on the language.
> > c) force the user to tell you the language of the query.
> > d) run the query through the same analyzer.
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message