drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <challapallira...@gmail.com>
Subject Re: Reading Parquet files with array or list columns
Date Fri, 30 Jun 2017 18:50:53 GMT
Hmm....I too see no simple workaround for the second case. Can you also
file a jira for the CTAS case? Drill could have been running short on heap
memory.

- Rahul

On Fri, Jun 30, 2017 at 11:46 AM, David Kincaid <kincaid.dave@gmail.com>
wrote:

> The view only works for the first example in the Jira I created. That was
> the workaround we have been using since January.
>
> Recently we've had a use case where we are running a Spark script to
> pre-join some data before we try to use it in Drill. That was the subject
> of the initial e-mail in this thread and the topic of the comment I made in
> the JIra on 6/17. As far as I've been able to tell there isn't a similar
> work around for this case that will make the column appear as an array.
>
> Note, I tried to use Drill to do that pre-join of the Parquet data using
> CTAS, but it ran for about 4 hours then crashed. The Spark script to do it
> runs in 14 minutes successfully.
>
> - Dave
>
> On Fri, Jun 30, 2017 at 1:38 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Like I suggested in the comment for DRILL-5183, can you try using a view
> as
> > a workaround until the issue gets resolved?
> >
> > On Fri, Jun 30, 2017 at 10:41 AM, David Kincaid <kincaid.dave@gmail.com>
> > wrote:
> >
> > > As far as I was able to discern it is not possible to actually use this
> > > column as an array in Drill at all. It just does not correctly read the
> > > Parquet. I have had a very similar defect I created in Jira back in
> > January
> > > that has had no attention at all. So we are moving on to other tools. I
> > > understand Drill is free and no one developing it owes me anything.
> It's
> > > just not going to work for us without proper support for nested objects
> > in
> > > Parquet format.
> > >
> > > Thanks for the reply though. It's much appreciated to have some
> > > acknowledgment that I raised a valid issue.
> > >
> > > - Dave
> > >
> > > On Fri, Jun 30, 2017 at 12:06 PM, François Méthot <fmethot78@gmail.com
> >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Have you tried:
> > > >    select column['list'][0]['element'] from ...
> > > >        should return "My First Value".
> > > >
> > > > or try:
> > > >     select flatten(column['list'])['element] from ...
> > > >
> > > > Hope it helps, in our data we have a column that looks like this:
> > > > [{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
> > > > "DATA":"thedata2"},.....]
> > > >
> > > > We ended doing custom function to do look up instead of doing costly
> > > > flatten technique.
> > > >
> > > > Francois
> > > >
> > > >
> > > >
> > > > On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid <
> > kincaid.dave@gmail.com>
> > > > wrote:
> > > >
> > > > > I'm having a problem querying Parquet files that were created from
> > > Spark
> > > > > and have columns that are array or list types. When I do a SELECT
> on
> > > > these
> > > > > columns they show up like this:
> > > > >
> > > > > {"list": [{"element": "My first value"}, {"element": "My second
> > > value"}]}
> > > > >
> > > > > which Drill does not recognize as a REPEATED column and is not
> really
> > > > > workable to hack around like I did in DRILL-5183 (
> > > > > https://issues.apache.org/jira/browse/DRILL-5183). I can get to
> one
> > > > value
> > > > > using something like t.columnName.`list`.`element` but that's not
> > > really
> > > > > feasible to use in a query.
> > > > >
> > > > > The little I could find on this by Googling around led me to this
> > > > document
> > > > > on the Parquet format Github page -
> > > > > https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md
> > .
> > > > This
> > > > > seems to say that Spark is writing these files correctly, but Drill
> > is
> > > > not
> > > > > interpreting them properly.
> > > > >
> > > > > Is there a workaround that anyone can help me to turn these columns
> > > into
> > > > > values that Drill understands as repeated values? This is a fairly
> > > urgent
> > > > > issue for us.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Dave
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message