lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karich <peat...@yahoo.de>
Subject Re: verifying that an index contains ONLY utf-8
Date Thu, 13 Jan 2011 18:12:38 GMT
 take a look also into icu4j which is one of the contrib projects ...

> converting on the fly is not supported by Solr but should be relative
> easy in Java.
> Also scanning is relative simple (accept only a range). Detection too:
> http://www.mozilla.org/projects/intl/chardet.html
>
>> We've created an index from a number of different documents that are
>> supplied by third parties. We want the index to only contain UTF-8
>> encoded characters. I have a couple questions about this:
>>
>> 1) Is there any way to be sure during indexing (by setting something
>> in the solr configuration?) that the documents that we index will
>> always be stored in utf-8? Can solr convert documents that need
>> converting on the fly, or can solr reject documents containing illegal
>> characters?
>>
>> 2) Is there a way to scan the existing index to find any string
>> containing non-utf8 characters? Or is there another way that I can
>> discover if any crept into my index?
>>
>


-- 
http://jetwick.com open twitter search


Mime
View raw message