drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kincaid <kincaid.d...@gmail.com>
Subject Re: Reading Parquet files with array or list columns
Date Fri, 30 Jun 2017 17:41:07 GMT
As far as I was able to discern it is not possible to actually use this
column as an array in Drill at all. It just does not correctly read the
Parquet. I have had a very similar defect I created in Jira back in January
that has had no attention at all. So we are moving on to other tools. I
understand Drill is free and no one developing it owes me anything. It's
just not going to work for us without proper support for nested objects in
Parquet format.

Thanks for the reply though. It's much appreciated to have some
acknowledgment that I raised a valid issue.

- Dave

On Fri, Jun 30, 2017 at 12:06 PM, François Méthot <fmethot78@gmail.com>
wrote:

> Hi,
>
> Have you tried:
>    select column['list'][0]['element'] from ...
>        should return "My First Value".
>
> or try:
>     select flatten(column['list'])['element] from ...
>
> Hope it helps, in our data we have a column that looks like this:
> [{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
> "DATA":"thedata2"},.....]
>
> We ended doing custom function to do look up instead of doing costly
> flatten technique.
>
> Francois
>
>
>
> On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid <kincaid.dave@gmail.com>
> wrote:
>
> > I'm having a problem querying Parquet files that were created from Spark
> > and have columns that are array or list types. When I do a SELECT on
> these
> > columns they show up like this:
> >
> > {"list": [{"element": "My first value"}, {"element": "My second value"}]}
> >
> > which Drill does not recognize as a REPEATED column and is not really
> > workable to hack around like I did in DRILL-5183 (
> > https://issues.apache.org/jira/browse/DRILL-5183). I can get to one
> value
> > using something like t.columnName.`list`.`element` but that's not really
> > feasible to use in a query.
> >
> > The little I could find on this by Googling around led me to this
> document
> > on the Parquet format Github page -
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
> This
> > seems to say that Spark is writing these files correctly, but Drill is
> not
> > interpreting them properly.
> >
> > Is there a workaround that anyone can help me to turn these columns into
> > values that Drill understands as repeated values? This is a fairly urgent
> > issue for us.
> >
> > Thanks,
> >
> > Dave
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message