drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kincaid <kincaid.d...@gmail.com>
Subject Reading Parquet files with array or list columns
Date Sun, 18 Jun 2017 02:04:51 GMT
I'm having a problem querying Parquet files that were created from Spark
and have columns that are array or list types. When I do a SELECT on these
columns they show up like this:

{"list": [{"element": "My first value"}, {"element": "My second value"}]}

which Drill does not recognize as a REPEATED column and is not really
workable to hack around like I did in DRILL-5183 (
https://issues.apache.org/jira/browse/DRILL-5183). I can get to one value
using something like t.columnName.`list`.`element` but that's not really
feasible to use in a query.

The little I could find on this by Googling around led me to this document
on the Parquet format Github page -
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md. This
seems to say that Spark is writing these files correctly, but Drill is not
interpreting them properly.

Is there a workaround that anyone can help me to turn these columns into
values that Drill understands as repeated values? This is a fairly urgent
issue for us.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message