drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Array in Drill
Date Wed, 09 Jan 2019 19:30:23 GMT
Hi Ben,

Drill is a SQL relational engine that was enhanced to support extended JSON types, including
arrays. The need to use only SQL to work with array imposes some major constraints.

As you noted, there is no way to project specific array items while leaving them in an array.
When you project array elements, as in your example, you create new, top-level scalar columns.

SELECT columns[0] AS `name`, columns[1] AS `address` ...

Is a common use case: it not only projects certain array entries, it projects them as new,
named, top-level columns. There is no way to say: "give me a new array with just elements
0 and 1 from the old array." Note that this problem also occurs for maps. 

Now, is this a problem? Not really. The primary clients for Drill are JDBC and OBDC, neither
of which can handle arrays. (The native Drill client, which almost no one uses, can handle
arrays and maps.) So, somewhere in the query, maps and arrays must be "flattened" to scalar
columns. So, it is actually handy that the projection shown above both identifies the array
items you want, AND puts them into a form that ODBC/JDBC can use.

Your example, by the way, is specific to CSV (really, any delimited text file) that does not
have headers. It is natural to want to project those columns into something named. SQL is
much easier to read with "`name` = 'Bob'" than "columns[0] = 'Bob'". As you noted, Parquet
also supports arrays as does JSON.

As Charles pointed out, there are a number of array-aware functions available. If you can't
find what you need, recall that Drill is open: you can write your own UDF. Then, you can do
as Charles and others have done: contribute them to Drill to enrich the set of operations
available to others. We Drill core developers are good at the internals, but you, Charles
and others better understand the kind of operations you want to apply to your data. We need
your help to build out the set of functions to solve your needs.

All that said, what are you trying to accomplish? Perhaps we can help you solve your specific

- Paul


    On Wednesday, January 9, 2019, 7:07:05 AM PST, benj.dev@laposte.net.INVALID <benj.dev@laposte.net.INVALID>

Although it's possible to work with ARRAY in DRILL :
SELECT columns[0], columns[1]... FROM ...;

I don't find any array function to know length or dimension of array or any other useful fct

I don't find any possibility to "build" my array with different data like :
SELECT ARRAY[data1, data2, data3] FROM ...;

Aggregate function that build array doesn't exists
Example : SELECT mycolumn1, array_agg(mycolumn2) FROM mytable GROUP BY mycolomn1;

Or would I miss something ?

I know that array type doesn't exists for Parquet files but these capacities will be very
and the limitation of Parquet should not limit the possibilities of DRILL.

Appreciate any info or return on this subject ?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message