drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <challapallira...@gmail.com>
Subject Re: Querying Delimited Sequence file
Date Wed, 31 Aug 2016 17:40:57 GMT
I will try the split_part function myself to see if I can reproduce your
issue. And I couldn't see the query which references the output of split
function. Something like the below should work

SELECT
  d.columns_arr[0],
  d.columns_arr[1]FROM (SELECT
  split(CONVERT_FROM(binary_value, 'UTF8'), chr(1)) columns_arrFROM data) d;


On Wed, Aug 31, 2016 at 1:50 AM, Robin Moffatt <
robin.moffatt@rittmanmead.com> wrote:

> Thanks, SPLIT_PART looks useful.
>
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
> split_part(version,'.',1),split_part(version,'.',2),
> split_part(version,'.',3)
> from sys.version;
> +---------+---------+---------+
> | EXPR$0  | EXPR$1  | EXPR$2  |
> +---------+---------+---------+
> | 1       | 7       | 0       |
> +---------+---------+---------+
> 1 row selected (0.351 seconds)
>
> But used with my actual data (sequence file), I get an error. I've
> successfully SPLIT it using CHR(1) for the \x01 delimiter:
>
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> split(CONVERT_FROM(binary_value, 'UTF8'),chr(1)) from
>  `/user/oracle/seq/pdb.soe.logon` limit 1;
> +--------+
> | EXPR$0 |
> +--------+
> | ["\u0000\u0000\u0000|I","PDB.SOE.LOGON","2016-08-30
> 10:34:01.000145","2016-08-30T11:34:07.934000","00000000000001558898","","
> 338328","13645","2016-08-30:11:34:01"]
> |
> +--------+
>
> But if I now try to access one of those elements, it errors:
>
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> split_part(CONVERT_FROM(binary_value, 'UTF8'),chr(1),1) from
>  `/user/oracle/seq/pdb.soe.logon` limit 5;
> Error: SYSTEM ERROR: IllegalArgumentException: length: -123 (expected: >=
> 0)
>
> Fragment 0:0
>
> [Error Id: beba85c3-8c5b-4c05-9ae7-d12263811af4 on
> cdh57-01-node-02.moffatt.me:31010] (state=,code=0)
> 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> split_part(CONVERT_FROM(binary_value, 'UTF8'),chr(1),2) from
>  `/user/oracle/seq/pdb.soe.logon` limit 5;
> Error: SYSTEM ERROR: IllegalArgumentException: length: -6 (expected: >= 0)
>
> Fragment 0:0
>
> [Error Id: b4f18223-2999-4388-9450-dc9683c543ec on
> cdh57-01-node-02.moffatt.me:31010] (state=,code=0)
>
>
> Should this work?
>
> thanks.
>
>
> On 30 August 2016 at 19:06, rahul challapalli <challapallirahul@gmail.com>
> wrote:
>
> > You should be able to use split_part function (I haven't tried it
> > myself...but it is supported). With this function you can extract
> > individual columns. Unfortunately I couldn't find the documentation for
> > this function as well. But it should be similar to how other databases
> > implement this function.
> >
> > Also as you have observed, split does not support delimiters with more
> than
> > one character. You can raise a jira and mark it as documentation related.
> >
> > Rahul
> >
> >
> >
> > On Tue, Aug 30, 2016 at 8:58 AM, Robin Moffatt <
> > robin.moffatt@rittmanmead.com> wrote:
> >
> > > Hi,
> > >
> > > Thanks - I think SPLIT gets me some of the way, but after the FLATTEN I
> > > want to PIVOT, so instead of :
> > >
> > > 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
> > > flatten(split(version,'.')) from sys.version;
> > > +---------+
> > > | EXPR$0  |
> > > +---------+
> > > | 1       |
> > > | 7       |
> > > | 0       |
> > > +---------+
> > >
> > > I'd get something like:
> > >
> > > +---------+---------+---------+
> > > | EXPR$0  | EXPR$1  | EXPR$2  |
> > > +---------+---------+---------+
> > > | 1       | 7       | 0       |
> > > +---------+---------+---------+
> > >
> > > I'm guessing this isn't possible in Drill yet?
> > >
> > > Also, what would be be the syntax to enter the \x01 character in the
> > SPLIT
> > > function? Entered literally I get an error:
> > >
> > > 0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
> > > split(CONVERT_FROM(binary_value, 'UTF8'),'\x01') from
> > >  `/user/oracle/seq/pdb.soe.logon` limit 5;
> > > Error: SYSTEM ERROR: IllegalArgumentException: Only single character
> > > delimiters are supported for split()
> > >
> > > BTW I didn't realise SPLIT was supported, and it's not listed in
> > > https://drill.apache.org/docs/string-manipulation/ or
> > > https://drill.apache.org/search/?q=split -- is there somewhere I
> should
> > > log
> > > this kind of documentation issue?
> > >
> > > thanks, Robin.
> > >
> > >
> > > On 30 August 2016 at 16:07, Zelaine Fong <zfong@maprtech.com> wrote:
> > >
> > > > If the column is delimited by some character, you can use the SPLIT()
> > > > function to separate the value into an array of values.  You can then
> > use
> > > > the FLATTEN() function to separate the array of values into
> individual
> > > > records.
> > > >
> > > > E.g., if your column has the value "a:b", where your delimiter is
> ":",
> > > you
> > > > would run the following query:
> > > >
> > > > 0: jdbc:drill:zk=local> select flatten(split(columns[0],':')) from
> > > > `/tmp/foo.csv`;
> > > > +---------+
> > > > | EXPR$0  |
> > > > +---------+
> > > > | a       |
> > > > | b       |
> > > > +---------+
> > > > 2 rows selected (0.319 seconds)
> > > >
> > > > Is that what you had in mind?
> > > >
> > > > -- Zelaine
> > > >
> > > > On Tue, Aug 30, 2016 at 7:17 AM, Robin Moffatt <
> > > > robin.moffatt@rittmanmead.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm trying to read a sequence file, in which the key is null and
> the
> > > > value
> > > > > holds multiple columns [1], delimited by \x01. In Hive I simply
> > define
> > > it
> > > > > as :
> > > > >
> > > > > CREATE EXTERNAL TABLE foo (col1 string, col2 string, col3
> timestamp)
> > > > > ROW FORMAT DELIMITED
> > > > > STORED as sequencefile
> > > > > LOCATION '/user/oracle/foo/bar';
> > > > >
> > > > > In Drill I've got as far as
> > > > >
> > > > > SELECT CONVERT_FROM(binary_value, 'UTF8') from
> > `/user/oracle/foo/bar`
> > > > >
> > > > > which yields the data but as a single column. I can cast it to
> > > individual
> > > > > columns but this is no use if the field positions change
> > > > >
> > > > > SELECT substr(CONVERT_FROM(binary_value, 'UTF8'),5,1) as
> > > > > col0,substr(CONVERT_FROM(binary_value, 'UTF8'),7,13) as
> > > > > col1,substr(CONVERT_FROM(binary_value, 'UTF8'),20,20) as col3 from
> > > > >  `/user/oracle/seq/pdb.soe.logon` limit 5;
> > > > > +-------+----------------+-----------------------+
> > > > > | col0  |      col1      |         col3          |
> > > > > +-------+----------------+-----------------------+
> > > > > | I     | PDB.SOE.LOGON  | 2016-07-29 13:36:40  |
> > > > >
> > > > >
> > > > > Is there a way to treat a column as delimited and burst it out into
> > > > > multiple columns? Presumably I could somehow dump the string
> contents
> > > to
> > > > > CSV and then re-read it - but I'm interested here in using Drill
> the
> > > > query
> > > > > existing data; wrangling it to suit Drill isn't really what I'm
> > looking
> > > > for
> > > > > (and maybe Drill just isn't the right tool here?).
> > > > >
> > > > >
> > > > > thanks,
> > > > >
> > > > > Robin.
> > > > >
> > > > > [1]
> > > > > https://docs.oracle.com/goldengate/bd1221/gg-bd/GADBD/
> > > > > GUID-85A82B2E-CD51-463A-8674-3D686C3C0EC0.htm#GADBD-GUID-
> > > > > 4CAFC347-0F7D-49AB-B293-EFBCE95B66D6
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message