drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Moffatt <robin.moff...@rittmanmead.com>
Subject Re: Querying Delimited Sequence file
Date Wed, 31 Aug 2016 08:50:08 GMT
Thanks, SPLIT_PART looks useful.

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
split_part(version,'.',1),split_part(version,'.',2),split_part(version,'.',3)
from sys.version;
+---------+---------+---------+
| EXPR$0  | EXPR$1  | EXPR$2  |
+---------+---------+---------+
| 1       | 7       | 0       |
+---------+---------+---------+
1 row selected (0.351 seconds)

But used with my actual data (sequence file), I get an error. I've
successfully SPLIT it using CHR(1) for the \x01 delimiter:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
split(CONVERT_FROM(binary_value, 'UTF8'),chr(1)) from
 `/user/oracle/seq/pdb.soe.logon` limit 1;
+--------+
| EXPR$0 |
+--------+
| ["\u0000\u0000\u0000|I","PDB.SOE.LOGON","2016-08-30
10:34:01.000145","2016-08-30T11:34:07.934000","00000000000001558898","","338328","13645","2016-08-30:11:34:01"]
|
+--------+

But if I now try to access one of those elements, it errors:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
split_part(CONVERT_FROM(binary_value, 'UTF8'),chr(1),1) from
 `/user/oracle/seq/pdb.soe.logon` limit 5;
Error: SYSTEM ERROR: IllegalArgumentException: length: -123 (expected: >= 0)

Fragment 0:0

[Error Id: beba85c3-8c5b-4c05-9ae7-d12263811af4 on
cdh57-01-node-02.moffatt.me:31010] (state=,code=0)
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
split_part(CONVERT_FROM(binary_value, 'UTF8'),chr(1),2) from
 `/user/oracle/seq/pdb.soe.logon` limit 5;
Error: SYSTEM ERROR: IllegalArgumentException: length: -6 (expected: >= 0)

Fragment 0:0

[Error Id: b4f18223-2999-4388-9450-dc9683c543ec on
cdh57-01-node-02.moffatt.me:31010] (state=,code=0)


Should this work?

thanks.


On 30 August 2016 at 19:06, rahul challapalli <challapallirahul@gmail.com>
wrote:

> You should be able to use split_part function (I haven't tried it
> myself...but it is supported). With this function you can extract
> individual columns. Unfortunately I couldn't find the documentation for
> this function as well. But it should be similar to how other databases
> implement this function.
>
> Also as you have observed, split does not support delimiters with more than
> one character. You can raise a jira and mark it as documentation related.
>
> Rahul
>
>
>
> On Tue, Aug 30, 2016 at 8:58 AM, Robin Moffatt <
> robin.moffatt@rittmanmead.com> wrote:
>
> > Hi,
> >
> > Thanks - I think SPLIT gets me some of the way, but after the FLATTEN I
> > want to PIVOT, so instead of :
> >
> > 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
> > flatten(split(version,'.')) from sys.version;
> > +---------+
> > | EXPR$0  |
> > +---------+
> > | 1       |
> > | 7       |
> > | 0       |
> > +---------+
> >
> > I'd get something like:
> >
> > +---------+---------+---------+
> > | EXPR$0  | EXPR$1  | EXPR$2  |
> > +---------+---------+---------+
> > | 1       | 7       | 0       |
> > +---------+---------+---------+
> >
> > I'm guessing this isn't possible in Drill yet?
> >
> > Also, what would be be the syntax to enter the \x01 character in the
> SPLIT
> > function? Entered literally I get an error:
> >
> > 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> > split(CONVERT_FROM(binary_value, 'UTF8'),'\x01') from
> >  `/user/oracle/seq/pdb.soe.logon` limit 5;
> > Error: SYSTEM ERROR: IllegalArgumentException: Only single character
> > delimiters are supported for split()
> >
> > BTW I didn't realise SPLIT was supported, and it's not listed in
> > https://drill.apache.org/docs/string-manipulation/ or
> > https://drill.apache.org/search/?q=split -- is there somewhere I should
> > log
> > this kind of documentation issue?
> >
> > thanks, Robin.
> >
> >
> > On 30 August 2016 at 16:07, Zelaine Fong <zfong@maprtech.com> wrote:
> >
> > > If the column is delimited by some character, you can use the SPLIT()
> > > function to separate the value into an array of values.  You can then
> use
> > > the FLATTEN() function to separate the array of values into individual
> > > records.
> > >
> > > E.g., if your column has the value "a:b", where your delimiter is ":",
> > you
> > > would run the following query:
> > >
> > > 0: jdbc:drill:zk=local> select flatten(split(columns[0],':')) from
> > > `/tmp/foo.csv`;
> > > +---------+
> > > | EXPR$0  |
> > > +---------+
> > > | a       |
> > > | b       |
> > > +---------+
> > > 2 rows selected (0.319 seconds)
> > >
> > > Is that what you had in mind?
> > >
> > > -- Zelaine
> > >
> > > On Tue, Aug 30, 2016 at 7:17 AM, Robin Moffatt <
> > > robin.moffatt@rittmanmead.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm trying to read a sequence file, in which the key is null and the
> > > value
> > > > holds multiple columns [1], delimited by \x01. In Hive I simply
> define
> > it
> > > > as :
> > > >
> > > > CREATE EXTERNAL TABLE foo (col1 string, col2 string, col3 timestamp)
> > > > ROW FORMAT DELIMITED
> > > > STORED as sequencefile
> > > > LOCATION '/user/oracle/foo/bar';
> > > >
> > > > In Drill I've got as far as
> > > >
> > > > SELECT CONVERT_FROM(binary_value, 'UTF8') from
> `/user/oracle/foo/bar`
> > > >
> > > > which yields the data but as a single column. I can cast it to
> > individual
> > > > columns but this is no use if the field positions change
> > > >
> > > > SELECT substr(CONVERT_FROM(binary_value, 'UTF8'),5,1) as
> > > > col0,substr(CONVERT_FROM(binary_value, 'UTF8'),7,13) as
> > > > col1,substr(CONVERT_FROM(binary_value, 'UTF8'),20,20) as col3 from
> > > >  `/user/oracle/seq/pdb.soe.logon` limit 5;
> > > > +-------+----------------+-----------------------+
> > > > | col0  |      col1      |         col3          |
> > > > +-------+----------------+-----------------------+
> > > > | I     | PDB.SOE.LOGON  | 2016-07-29 13:36:40  |
> > > >
> > > >
> > > > Is there a way to treat a column as delimited and burst it out into
> > > > multiple columns? Presumably I could somehow dump the string contents
> > to
> > > > CSV and then re-read it - but I'm interested here in using Drill the
> > > query
> > > > existing data; wrangling it to suit Drill isn't really what I'm
> looking
> > > for
> > > > (and maybe Drill just isn't the right tool here?).
> > > >
> > > >
> > > > thanks,
> > > >
> > > > Robin.
> > > >
> > > > [1]
> > > > https://docs.oracle.com/goldengate/bd1221/gg-bd/GADBD/
> > > > GUID-85A82B2E-CD51-463A-8674-3D686C3C0EC0.htm#GADBD-GUID-
> > > > 4CAFC347-0F7D-49AB-B293-EFBCE95B66D6
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message