lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bess Sadler <>
Subject Re: Internationalization
Date Wed, 17 Jan 2007 15:41:02 GMT

On Jan 17, 2007, at 3:07 AM, Erik Hatcher wrote:

> Why are you assigning all fields to a "string" type?  That indexes  
> each field as-is, with no tokenization at all.  How are you using  
> that field from the front-end?   I'd think you'd want to copyField  
> everything into a "text" field.

The short answer is there is no good reason for this. I guess I just  
hadn't thought too hard yet about the difference between string and  
text. This particular project is a gazetteer, so we're mostly  
indexing proper names (e.g. "China" and "中国") which are mostly one- 
word and so don't need much tokenization anyway. But of course this  
isn't true for all our fields, and even some proper names (e.g., "lha  
sa") might benefit from tokenization.

I've been planning to separately index all our Chinese text with the  
ChineseAnalyzer (á la pages 142 - 145 in Lucene in Action) and Ed  
Garrett (who I think is also on this list... hi, Ed!) at U Michigan  
is working on a Tibetan analyzer that I also want to use, I just  
haven't got that far yet.

So now I'm all motivated to go re-write this thing so that it process  
each language properly. Maybe I'll write something up for the wiki  
when I'm done.

Thanks again, Erik.


Elizabeth (Bess) Sadler
Head, Technical and Metadata Services
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904
(434) 243-2305

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message