lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jayson Minard" <jayson.min...@gmail.com>
Subject Re: multi-language searching with Solr
Date Thu, 08 May 2008 14:53:40 GMT
Using distributed search to solve the language problem is very similar
to using multiple fields to solve it.  You need different analyzers
for each language, and that gives you another way to have just that;
trading management of multiple fields for management of multiple cores
or shards.  In your case that may be a favorable trade!

The schema differences of analyzers between shards or cores won't be a
blocking point, Solr does not really care that seriously about shard
differences.  It really cares about the ability of a field you mention
in a query (field list, sort order, facet) to "possibly" be a valid
field name in the other shards, meaning it is a static field or passes
a dynamic field rule.  It doesn't actually have to be there at all.

This is a good track and this discussion has typically ended with the
"do the multiple fields approach" whereas now that the Solr world has
changed with distributed search, new ideas should be pursued as there
just might be better ways for different use cases.  Also language
detection and universal stemmers/analyzers might be possible or
something else we just haven't gotten to for Solr or Lucene.  I would
also like to hear if anyone has implemented the sometimes mentioned
per-token tagging of language in payloads or elsewhere and how that
has worked, or even the token prefixing with language.  What do
companies like BasisTech that are the guys behind linguistics for a
lot of the big search engines (and work with Lucene) do to solve this?
 Anything there we can learn?

Let us know how it works out!
--j

On Thu, May 8, 2008 at 2:51 AM, Gereon Steffens <gereon@steffens.org> wrote:
>
>
> > These are shards of one index and not multiple indexes.  There is
> > probably a way to get each shard to contain one language but then you
> > end up with x servers for x languages, and some will be under utilized
> > while other will be over utilized.
> >
>  Schemas will be identical, except for analysers. The language distribution
> I'm dealing with is about 60% german, 40% english. For availability reasons,
> each shard needs to run on at least two instances anyway, with a load
> balancer in front, so I think I'll be able to adjust utilization that way.
>
>  Gereon
>

Mime
View raw message