lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benson Margulies" <bimargul...@gmail.com>
Subject Re: Language support
Date Thu, 20 Mar 2008 16:43:46 GMT
You can store in one field if you manage to hide a language code with the
text. XML is overkill but effective for this. At one point, we'd
investigated how to allow a Lucene analyzer to see more than one field (the
language code as well as the text) but I don't think we came up with
anything.


On Thu, Mar 20, 2008 at 12:39 PM, David King <dking@ketralnis.com> wrote:

> > Unless you can come up with language-neutral tokenization and
> > stemming, you
> > need to:
> > a) know the language of each document.
> > b) run a different analyzer depending on the language.
> > c) force the user to tell you the language of the query.
> > d) run the query through the same analyzer.
>
> I can do all of those. This implies storing all of the different
> languages in different fields, right? Then changing the default search-
> field to the language of the query for every query?
>
>
> >
> >
> >
> >
> > On Thu, Mar 20, 2008 at 12:17 PM, David King <dking@ketralnis.com>
> > wrote:
> >
> >>> You may be interested in a recent discussion that took place on a
> >>> similar
> >>> subject:
> >>> http://www.mail-archive.com/solr-user@lucene.apache.org/
> >>> msg09332.html
> >>
> >> Interesting, yes. But since it doesn't actually exist, it's not much
> >> help.
> >>
> >> I guess what I'm asking is, if my approach seems convoluted, I'm
> >> probably doing it wrong, so how *a*re people solving the problem of
> >> searching over multiple languages? What is the canonical way to do
> >> this?
> >>
> >>
> >>>
> >>>
> >>> Nicolas
> >>>
> >>> -----Message d'origine-----
> >>> De : David King [mailto:dking@ketralnis.com]
> >>> Envoyé : mercredi 19 mars 2008 20:07
> >>> À : solr-user@lucene.apache.org
> >>> Objet : Language support
> >>>
> >>> This has probably been asked before, but I'm having trouble finding
> >>> it. Basically, we want to be able to search for content across
> >>> several
> >>> languages, given that we know what language a datum and a query are
> >>> in. Is there an obvious way to do this?
> >>>
> >>> Here's the longer version: I am trying to index content that
> >>> occurs in
> >>> multiple languages, including Asian languages. I'm in the process of
> >>> moving from PyLucene to Solr. In PyLucene, I would have a list of
> >>> analysers:
> >>>
> >>>    analyzers = dict(en = pyluc.SnowballAnalyzer("English"),
> >>>                     cs = pyluc.CzechAnalyzer(),
> >>>                     pt = pyluc.SnowballAnalyzer("Portuguese"),
> >>>                     ...
> >>>
> >>> Then when I want to index something, I do
> >>>
> >>>   writer = pyluc.IndexWriter(store, analyzer, create)
> >>>   writer.addDocument(d.doc)
> >>>
> >>> That is, I tell Lucene the language of every datum, and the analyser
> >>> to use when writing out the field. Then when I want to search
> >>> against
> >>> it, I do
> >>>
> >>>    analyzer = LanguageAnalyzer.getanal(lang)
> >>>    q = pyluc.QueryParser(field, analyzer).parse(value)
> >>>
> >>> And use that QueryParser to parse the query in the given language
> >>> before sending it off to PyLucene. (off-topic: getanal() is
> >>> perhaps my
> >>> favourite function-name ever). So the language of a given datum is
> >>> attached to the datum itself. In Solr, however, this appears to be
> >>> attached to the field, not to the individual data in it:
> >>>
> >>>    <fieldType name="text_greek" class="solr.TextField">
> >>>      <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/>
> >>>    </fieldType>
> >>>
> >>> Does this mean there there's no way to have a single "contents"
> >>> field
> >>> that has content in multiple languages, and still have the queries
> >>> be
> >>> parsed and stemmed correctly? How are other people handling this?
> >>> Does
> >>> it makes sense to write a tokeniser factory and a query factory that
> >>> look at, say, the 'lang' field and return the correct tokenisers?
> >>> Does
> >>> this already exist?
> >>>
> >>> The other alternative is to have a text_zh field, a text_en field,
> >>> etc, and to modify the query to search on that field depending on
> >>> the
> >>> language of the query, but that seems kind of hacky to me,
> >>> especially
> >>> if a query may be against more than one language. Is this the
> >>> accepted
> >>> way to go about it? Is there a benefit to this method over writing a
> >>> detecting tokeniser factory?
> >>
> >>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message