drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: UTF coding in JSON docs
Date Fri, 13 Mar 2015 17:59:25 GMT
Was having some issues with the hex codes on large data sets.

Using the following better results for most western languages.

regexp_replace(<column>, '[^ -~¡-ÿ]', '°’)

—Andries


On Mar 12, 2015, at 8:24 AM, Andries Engelbrecht <aengelbrecht@maprtech.com> wrote:

> Just an update on this topic.
> 
> The regexp_replace function uses a single escape \ and then the hex code to function
properly.
> The ^ is for inverse, or basically for any characters that does not full within the range
specified to be replaced.
> 
> I.e.
> regexp_replace(<column_name>, '[^\x00-\xad]', '°’)
> 
> 
> I used this UTF 8 table for hex codes.
> http://www.utf8-chartable.de/unicode-utf8-table.pl
> 
> —Andries 
> 
> 
> 
> On Mar 9, 2015, at 5:38 PM, Andries Engelbrecht <aengelbrecht@maprtech.com> wrote:
> 
>> I should have clarified more. It turns out the encoding is UTF8, but various characters
(since UTF8 supports all unicode and then some) within the encoding was causing issues for
tools using Drill.
>> Using regex_replace to replace undesired characters resolved the issue since the
characters are not used and only caused issues.
>> 
>> Thanks
>> —Andries
>> 
>> 
>> On Mar 9, 2015, at 12:42 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>> 
>>> This is dangerous if the source file does not contain UTF-8 but you treat
>>> it as if it does.  The problem is that characters with high bit set will be
>>> treated as leading characters of a multi-byte UTF-8 character encoding.
>>> 
>>> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
>>> aengelbrecht@maprtech.com> wrote:
>>> 
>>>> The answer for my problem was to use regexp_replace.
>>>> 
>>>> In this case just using an underscore to replace any characters outside of
>>>> the expected range using the UTF8 character set table.
>>>> 
>>>> Instead of just selecting the column used this
>>>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]',
'_')
>>>> 
>>>> The ^ in the front is to select the inverse (or any characters not
>>>> specified in the range of the square brackets []
>>>> 
>>>> (Make a note of this as I’m sure many will run into this issues in the
>>>> future with various JSON data).
>>>> 
>>>> —Andries
>>>> 
>>>> 
>>>> 
>>>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <aengelbrecht@maprtech.com>
>>>> wrote:
>>>> 
>>>>> How can I convert JSON data with various characters in a text field to
a
>>>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
>>>>> Or if needs be filter out the records with characters not in UTF8?
>>>>> 
>>>>> CONVERT_TO is for binary and cast as varchar still passes these
>>>> characters through.
>>>>> 
>>>>> Results in sqlline works, but when using tools through ODBC various
>>>> failures occur (especially tools not capable of handling characters outside
>>>> of UTF8).
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> —Andries
>>>> 
>>>> 
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message