parquet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Column index testing break down
Date Thu, 07 Mar 2019 16:03:57 GMT
It makes me very sad that Impala has this bespoke Parquet
implementation. I never really understood the benefit of doing the
work there rather than in Apache Parquet. I never found the arguments
"why" I've heard over the years (that the implementation needed to be
tightly coupled to the Impala runtime) to be persuasive. At this point
it is probably too late to do anything about it

Thanks

On Thu, Mar 7, 2019 at 9:58 AM Anna Szonyi <szonyi@cloudera.com.invalid> wrote:
>
> Hi Wes,
>
> Zoltan has created a C++ implementation for Impala. We would be happy to
> contribute it to Parquet cpp when we have time or if someone is keen on
> getting it in sooner and wants to take it over, we would be happy to review
> it.
> Feel free to check it out and chime in to the review for the Impala
> implementation: https://gerrit.cloudera.org/#/c/12065/.
>
> Best,
> Anna
>
> On Wed, Mar 6, 2019 at 4:17 PM Wes McKinney <wesmckinn@gmail.com> wrote:
>
> > Is there anyone who might be able to take on the project of
> > implementing this in C++? We're having an increasing number of C++
> > Parquet users nowadays.
> >
> > On Tue, Mar 5, 2019 at 9:54 AM Anna Szonyi <szonyi@cloudera.com.invalid>
> > wrote:
> > >
> > > Hi dev@ community,
> > >
> > > This week I would like to ask for some feedback on the testing we've been
> > > sending out.
> > > We've been sharing the most important test cases we've created for the
> > > write path of the parquet column index feature, now we would like to hear
> > > from you!
> > >
> > > Is there anything else you feel is missing or would like to get clarity
> > on?
> > >
> > > Thanks,
> > > Anna
> > >
> > > On Mon, Feb 25, 2019 at 6:26 PM Anna Szonyi <szonyi@cloudera.com> wrote:
> > >
> > > > Hi dev@,
> > > >
> > > > After a week off, this week we have an excerpt from our internal data
> > > > interoperability testing, which tests compatibility between Hive,
> > Spark and
> > > > Impala over Avro and Parquet. This test case is tailor-made to test
> > > > specific layouts so that files written using parquet-mr can be read by
> > any
> > > > of the above mentioned components. We have also checked fault injection
> > > > cases.
> > > >
> > > > The test suite is private currently, however we have made the test
> > classes
> > > > corresponding to the following document public:
> > > >
> > https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A
> > > >
> > > > Please find the test cases and their results here:
> > > > https://github.com/zivanfi/column-indexes-data-interop-tests-excerpts
> > > >
> > > > Best,
> > > > Anna
> > > >
> > > >
> > > >
> > > > On Mon, Feb 11, 2019 at 4:57 PM Anna Szonyi <szonyi@cloudera.com>
> > wrote:
> > > >
> > > >> Hi dev@,
> > > >>
> > > >> Last week we had a twofer: e2e tool and integration test validating
> > the
> > > >> contract of column indexes/indices (if all values are between min
and
> > max
> > > >> and if set whether the boundary order is correct). There are some
> > takeaways
> > > >> and corrections to be made to the former (like the max->min typo)
-
> > thanks
> > > >> for the feedback on that!
> > > >>
> > > >> The next installment is also an integration test that tests the
> > filtering
> > > >> logic on files including simple and special cases (user defined
> > function,
> > > >> complex filtering, no filtering, etc.).
> > > >>
> > > >>
> > > >>
> > https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java
> > > >>
> > > >> Please let me know if you have any questions/comments.
> > > >>
> > > >> Best,
> > > >> Anna
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> >

Mime
View raw message