parquet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Armstrong <tarmstr...@cloudera.com.INVALID>
Subject Re: Column index testing break down
Date Thu, 07 Mar 2019 17:51:52 GMT
I think you and I have different priors on this, Wes. It's definitely not
clear-cut. I think it's an interesting point to discuss and it's
unfortunate that you feel that way.

Partially the current state of things is due to path-dependence, but there
are some parts of the Impala runtime that it's important for us to
integrate with, e.g. I/O, codegen, memory management, error handling, query
cancellation, etc. It's hard to integrate with those if we have an external
library driving the Parquet scanning.

I agree if we think parquet-cpp is completely mutable and we can make
whatever tweaks and add whatever hooks or interfaces to it that we need to
get the behaviour we want, then we could get most of the advantages of
tight coupling. But that might end up just being tight coupling with a
library boundary in the middle, which is worse in many way. I'm also not
sure that parquet-cpp is that mutable - at some point I think there would
be resistance to adding hooks that nobody else wants or refactoring things
in a way that worked for us. We may also prefer different trade-offs in
complexity vs performance to the rest of the community.

There's also the question of whether the hoop-jumping required to do
optimisations across library and project boundaries is lower effort than
implementing things in the simplest way possible in one place but paying
the cost of duplication. I can see why someone might come down on the other
side on this point.

For most open source libraries we use, we'd accept the trade-off of less
control for the advantages of having a community around the library. But
for stuff that's performance-critical, we're generally willing to invest
more time into maintaining it.

Interestingly we have a bit of a natural experiment going with the ORC
scanner implementation which took the opposite approach. There's a huge gap
in performance and the memory management is pretty bad (the Parquet
implementation can do scans with a fixed memory budget, etc). Some of that
is just an effort thing but once you dig into it there are some deeper
issues that require the kind of hooks that I was talking about. I did a
presentation recently that enumerated some of the challenges, might as well
share:
https://docs.google.com/presentation/d/1o-vAm933Q1WvApzSnYK0bpIhmJ0w1-9xdaKF2wu-7FQ/edit?usp=sharing

Anyway, sorry for the diatribe, but I do find it an interesting discussion
and am quite opinionated about it.

- Tim

On Thu, Mar 7, 2019 at 8:04 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> It makes me very sad that Impala has this bespoke Parquet
> implementation. I never really understood the benefit of doing the
> work there rather than in Apache Parquet. I never found the arguments
> "why" I've heard over the years (that the implementation needed to be
> tightly coupled to the Impala runtime) to be persuasive. At this point
> it is probably too late to do anything about it
>
> Thanks
>
> On Thu, Mar 7, 2019 at 9:58 AM Anna Szonyi <szonyi@cloudera.com.invalid>
> wrote:
> >
> > Hi Wes,
> >
> > Zoltan has created a C++ implementation for Impala. We would be happy to
> > contribute it to Parquet cpp when we have time or if someone is keen on
> > getting it in sooner and wants to take it over, we would be happy to
> review
> > it.
> > Feel free to check it out and chime in to the review for the Impala
> > implementation: https://gerrit.cloudera.org/#/c/12065/.
> >
> > Best,
> > Anna
> >
> > On Wed, Mar 6, 2019 at 4:17 PM Wes McKinney <wesmckinn@gmail.com> wrote:
> >
> > > Is there anyone who might be able to take on the project of
> > > implementing this in C++? We're having an increasing number of C++
> > > Parquet users nowadays.
> > >
> > > On Tue, Mar 5, 2019 at 9:54 AM Anna Szonyi <szonyi@cloudera.com.invalid
> >
> > > wrote:
> > > >
> > > > Hi dev@ community,
> > > >
> > > > This week I would like to ask for some feedback on the testing we've
> been
> > > > sending out.
> > > > We've been sharing the most important test cases we've created for
> the
> > > > write path of the parquet column index feature, now we would like to
> hear
> > > > from you!
> > > >
> > > > Is there anything else you feel is missing or would like to get
> clarity
> > > on?
> > > >
> > > > Thanks,
> > > > Anna
> > > >
> > > > On Mon, Feb 25, 2019 at 6:26 PM Anna Szonyi <szonyi@cloudera.com>
> wrote:
> > > >
> > > > > Hi dev@,
> > > > >
> > > > > After a week off, this week we have an excerpt from our internal
> data
> > > > > interoperability testing, which tests compatibility between Hive,
> > > Spark and
> > > > > Impala over Avro and Parquet. This test case is tailor-made to test
> > > > > specific layouts so that files written using parquet-mr can be
> read by
> > > any
> > > > > of the above mentioned components. We have also checked fault
> injection
> > > > > cases.
> > > > >
> > > > > The test suite is private currently, however we have made the test
> > > classes
> > > > > corresponding to the following document public:
> > > > >
> > >
> https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A
> > > > >
> > > > > Please find the test cases and their results here:
> > > > >
> https://github.com/zivanfi/column-indexes-data-interop-tests-excerpts
> > > > >
> > > > > Best,
> > > > > Anna
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Feb 11, 2019 at 4:57 PM Anna Szonyi <szonyi@cloudera.com>
> > > wrote:
> > > > >
> > > > >> Hi dev@,
> > > > >>
> > > > >> Last week we had a twofer: e2e tool and integration test
> validating
> > > the
> > > > >> contract of column indexes/indices (if all values are between
min
> and
> > > max
> > > > >> and if set whether the boundary order is correct). There are
some
> > > takeaways
> > > > >> and corrections to be made to the former (like the max->min
typo)
> -
> > > thanks
> > > > >> for the feedback on that!
> > > > >>
> > > > >> The next installment is also an integration test that tests the
> > > filtering
> > > > >> logic on files including simple and special cases (user defined
> > > function,
> > > > >> complex filtering, no filtering, etc.).
> > > > >>
> > > > >>
> > > > >>
> > >
> https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java
> > > > >>
> > > > >> Please let me know if you have any questions/comments.
> > > > >>
> > > > >> Best,
> > > > >> Anna
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message