drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Barclay <dbarc...@maprtech.com>
Subject Re: [DISCUSS] Processing non-printable characters in Drill
Date Thu, 22 Oct 2015 19:51:13 GMT
Khurram Faraaz wrote:
> ... It looks like Drill processes
> non-printable characters in both cases, with and without the new text
> reader (exec.storage.enable_new_text_reader)
> Should we throw an error since these are non-printable characters ?
No, I don't think so.  Does there seem to be any need to reject non-printable characters?

> ...
> Content from the csv file used in test
> 1,^A
> 2,^B
> 3,^C
> 4,^D
> 5,^E
> 6,^F
> 0: jdbc:drill:schema=dfs.tmp> select * from `nonPrintables.csv`;
> +-----------------+
> |     columns     |
> +-----------------+
> | ["1","\u0001"]  |
> | ["2","\u0002"]  |
> | ["3","\u0003"]  |
> | ["4","\u0004"]  |
> | ["5","\u0005"]  |
> | ["6","\u0006"]  |
> +-----------------+
> 6 rows selected (0.521 seconds)
> 0: jdbc:drill:schema=dfs.tmp> select columns[1] from `nonPrintables.csv`;
> +---------+
> | EXPR$0  |
> +---------+
> |        |
> |        |
> |        |
> |        |
> |        |
> |        |
> +---------+
> 6 rows selected (0.382 seconds)
Note what's going on there (re the difference between those two outputs):

In the first case, the strings with unprintable characters go through Drill's conversion of
a value of a complex type (e.g., VARCHAR ARRAY) to a JSON string (in order to have a string
to return through the JDBC API).  That conversion encodes string (VARCHAR) values as JSON
string tokens, using JSON's escape sequences for the unprintable characters.  Finally, the
resultant JSON string (the whole string of JSON, not the JSON string token) is displayed by
SQLLine or the web UI or whatever.  (And don't forget the step of your copying and pasting
into your message.)

In the second case, the core part of Drill is directly returning the characters  strings from
the data through the JDBC API.  Then, SQLLine or the web UI or whatever is deciding how to
display those strings--including how handle any special, e.g., unprintable, characters.  Evidently,
SQLLine doesn't render unprintable characters into some visible form.  It probably just writes
them to your terminal's output stream.  Since your terminal doesn't render them especially
either, the characters still aren't visible, and when you copied to paste to compose your
e-mail message, there was nothing from those special characters to copy.

(Actually, the non-printable characters are slightly visible--note how the six lines with
visually blank values have terminating vertical-bar characters that don't line up with the
other terminating "+" or "|" characters.)

 From the point of view of the core part of Drill, it's up to the client of the JDBC API to
decide how to display values, including character string with unprintable characters.  (The
JDBC API returns the Java representations (String objects) of the VARCHAR values.)

However, from the point of view of users, SQLLine (and Drill's web UI too) should render all
values visibly, including character strings with unprintable characters.

(They should also render byte strings competently, e.g., rendering in hex the bytes themselves
rather than displaying in hex the hash code of the Java byte array object that contains (a
specific copy of) the bytes of the byte string(!).)


Daniel Barclay
MapR Technologies

View raw message