drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: UTF coding in JSON docs
Date Tue, 10 Mar 2015 00:38:54 GMT
I should have clarified more. It turns out the encoding is UTF8, but various characters (since
UTF8 supports all unicode and then some) within the encoding was causing issues for tools
using Drill.
Using regex_replace to replace undesired characters resolved the issue since the characters
are not used and only caused issues.

Thanks
—Andries


On Mar 9, 2015, at 12:42 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> This is dangerous if the source file does not contain UTF-8 but you treat
> it as if it does.  The problem is that characters with high bit set will be
> treated as leading characters of a multi-byte UTF-8 character encoding.
> 
> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> The answer for my problem was to use regexp_replace.
>> 
>> In this case just using an underscore to replace any characters outside of
>> the expected range using the UTF8 character set table.
>> 
>> Instead of just selecting the column used this
>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')
>> 
>> The ^ in the front is to select the inverse (or any characters not
>> specified in the range of the square brackets []
>> 
>> (Make a note of this as I’m sure many will run into this issues in the
>> future with various JSON data).
>> 
>> —Andries
>> 
>> 
>> 
>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <aengelbrecht@maprtech.com>
>> wrote:
>> 
>>> How can I convert JSON data with various characters in a text field to a
>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
>>> Or if needs be filter out the records with characters not in UTF8?
>>> 
>>> CONVERT_TO is for binary and cast as varchar still passes these
>> characters through.
>>> 
>>> Results in sqlline works, but when using tools through ODBC various
>> failures occur (especially tools not capable of handling characters outside
>> of UTF8).
>>> 
>>> 
>>> Thanks
>>> —Andries
>> 
>> 


Mime
View raw message