lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: Solr Multilingual Indexing with one field- Guidance
Date Fri, 08 May 2015 10:20:52 GMT
Is it possible to know a little bit more about the nature of that
multi-lingual field ?
I can see the keywordTokenizer and then a lot of grams calculated from that
token .
What is that field used for ?

2015-05-07 19:23 GMT+01:00 Kuntal Ganguly <gangulykuntal1986@gmail.com>:

> Our current production index size is 1.5 TB with 3 shards. Currently we
> have the following field type:
>
> <fieldType name="text_ngram" class="solr.TextField"
> positionIncrementGap="100">
>
> <analyzer type="query">
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <analyzer type="index">
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.CustomNGramFilterFactory" minGramSize="3"
> maxGramSize="30" preserveOriginal="true"/>
> </analyzer>
> </fieldType>
>
> And the above field type is working well for the US and English language
> clients.
>
> Now we have some new Chinese and Japanese client ,so after google
>
> http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/
>
> https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search
>
> for best approach for multilingual index,there seems to be pros/cons
> associated with every approach.
>
> Then i tried RnD with a single field approach and here's my new field type:
>
> <fieldType name="text_multi" class="solr.TextField"
> positionIncrementGap="100">
>
> <analyzer type="query">
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> </analyzer>
> <analyzer type="index">
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.CustomNGramFilterFactory" minGramSize="3"
> maxGramSize="30" preserveOriginal="true"/>
> </analyzer>
> </fieldType>
>
> I have kept the same tokenizer, only changed the filters.And it is working
> well with all existing search /use-case for English documents as well as
> new use case for Chinese/Japanese documents.
>
> Now i have the following questions to the Solr experts/developer:
>
> 1) Is this a correct approach to do it? Or i'm missing something?
>
> 2) Can you give me an example where there will be problem with this above
> new field type? A use-case/scenario with example will be very helpful.
>
> 3) Also is there any problem in future with different clients coming up?
>
> Please provide some guidance
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message