lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benson Margulies" <bimargul...@gmail.com>
Subject Re: Language support
Date Thu, 20 Mar 2008 18:05:36 GMT
Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis.
All that makes sense.

On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood <wunderwood@netflix.com>
wrote:

> Extreme, but guaranteed to work and it avoids bad IDF when there are
> inter-language collisions. In Ultraseek, we only stored the hash, so
> the size of the source token didn't matter.
>
> Trademarks are a bad source of collisions and anomalous IDF. If you have
> LaserJet support docs in 20 languages, the term "LaserJet" will have
> a document frequency 20X higher than the terms in a single language
> and will score too low.
>
> Ultraseek handles macaronic documents when the script makes it possible,
> for example, roman is sent to the English stemmer in a Japanese document,
> Hangul always goes to the Korean segmenter/stemmer.
>
> A simpler approach is to tag each document with a language, like
> "lang:de",
> then use a filter query to restrict the documents to the query language.
>
> Per-token tagging still strikes me as the "right" approach. It makes
> all sorts of things work, like keeping fuzzy matches within the same
> language. We didn't do it in Ultraseek because it would have been an
> incompatible index change and the benefit didn't justify that.
>
> wunder
> ==
> Walter Underwood
> Former Ultraseek Architect
> Current Entire Netflix Search Department
>
> On 3/20/08 9:45 AM, "Benson Margulies" <bimargulies@gmail.com> wrote:
>
> > Token/by/token seems a bit extreme. Are you concerned with macaronic
> > documents?
> >
> > On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <
> wunderwood@netflix.com>
> > wrote:
> >
> >> Nice list.
> >>
> >> You may still need to mark the language of each document. There are
> >> plenty of cross-language collisions: "die" and "boot" have different
> >> meanings in German and English. Proper nouns ("Laserjet") may be the
> >> same in all languages, a different problem if you are trying to get
> >> answers in one language.
> >>
> >> At one point, I considered using Unicode language tagging on each
> >> token to keep it all straight. Effectively, index "de/Boot" or
> >> "en/Laserjet".
> >>
> >> wunder
> >>
> >> On 3/20/08 9:20 AM, "Benson Margulies" <bimargulies@gmail.com> wrote:
> >>
> >>> Unless you can come up with language-neutral tokenization and
> stemming,
> >>> you
> >> need to:
> >>>
> >>> a) know the language of each document.
> >>> b) run a different
> >>> analyzer depending on the language.
> >>> c) force the user to tell you the language of the query.
> >>> d) run the query through the same analyzer.
> >>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message