drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From François Méthot <fmetho...@gmail.com>
Subject Re: Reading Parquet files with array or list columns
Date Fri, 30 Jun 2017 17:06:44 GMT
Hi,

Have you tried:
   select column['list'][0]['element'] from ...
       should return "My First Value".

or try:
    select flatten(column['list'])['element] from ...

Hope it helps, in our data we have a column that looks like this:
[{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
"DATA":"thedata2"},.....]

We ended doing custom function to do look up instead of doing costly
flatten technique.

Francois



On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid <kincaid.dave@gmail.com>
wrote:

> I'm having a problem querying Parquet files that were created from Spark
> and have columns that are array or list types. When I do a SELECT on these
> columns they show up like this:
>
> {"list": [{"element": "My first value"}, {"element": "My second value"}]}
>
> which Drill does not recognize as a REPEATED column and is not really
> workable to hack around like I did in DRILL-5183 (
> https://issues.apache.org/jira/browse/DRILL-5183). I can get to one value
> using something like t.columnName.`list`.`element` but that's not really
> feasible to use in a query.
>
> The little I could find on this by Googling around led me to this document
> on the Parquet format Github page -
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md. This
> seems to say that Spark is writing these files correctly, but Drill is not
> interpreting them properly.
>
> Is there a workaround that anyone can help me to turn these columns into
> values that Drill understands as repeated values? This is a fairly urgent
> issue for us.
>
> Thanks,
>
> Dave
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message