hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ting(Goden) Yao" <t...@pivotal.io>
Subject Re: Support orc format
Date Wed, 22 Jun 2016 22:17:19 GMT
1) the framework is not designed by HAWQ community - it was from Postgres
2) the JIRA itself is titled as "ORC as native format" which has nothing to
do with this framework

We should not try to lump multiple features, ideas in one JIRA


On Wed, Jun 22, 2016 at 12:28 AM Lei Chang <lei_chang@apache.org> wrote:

> On Wed, Jun 22, 2016 at 1:39 AM, Goden Yao <godenyao@apache.org> wrote:
>
> > This is not comparable as native vs. external.
> > The design doc attached in HAWQ-786
> > <https://issues.apache.org/jira/browse/HAWQ-786>, as some community
> > responses in the JIRA, is mixing up an External Table data access
> framework
> > with a file format support.
> >
> > If the JIRA is merely about using ORC as native file format as we see its
> > popularity in the Hadoop community and potentially want to replace
> parquet
> > with ORC as default for its benefits and advantages, this JIRA should be
> > focusing on the native file format part and how to integrate with C
> library
> > from Apache ORC project.
> >
>
>
> as it was described in the JIRA. the framework is designed as a general
> framework.
>
> it can also potentially be used for external data. there is an example
> showing the usage.
>
>
> >
> > To answer Roman's questions, I think we first need to understand user
> > scenario with external tables (with ORC format), which is users :
> > 1) already have ORC files landed in HDFS (or stored as Hive tables)
> > 2) want to query from HAWQ, so they may get performance gain with MPP
> > architecture provided by HAWQ, instead of MR jobs.
> > 3) want to avoid data duplication, which means they don't want to load
> data
> > into HAWQ native format (so doesn't matter what native format HAWQ uses
> to
> > store the table)
> >
> > Given that, I think it's worth a further discussion in the theme of
> > improving external data source access/query performance.
> >
> > Thanks
> > -Goden
> >
> >
> >
> > On Mon, Jun 20, 2016 at 5:55 PM Lei Chang <lei_chang@apache.org> wrote:
> >
> > > On Tue, Jun 21, 2016 at 8:38 AM, Roman Shaposhnik <
> roman@shaposhnik.org>
> > > wrote:
> > >
> > > > On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <mli@pivotal.io> wrote:
> > > > > Hi Guys,
> > > > >
> > > > > ORC (Optimized Row Columnar) is a very popular open source format
> > > adopted
> > > > > in some major components in Hadoop eco-system. It is also used by
a
> > lot
> > > > of
> > > > > users. The advantages of supporting ORC storage in HAWQ are in two
> > > folds:
> > > > > firstly, it makes HAWQ more Hadoop native which interacts with
> other
> > > > > components more easily; secondly, ORC stores some meta info for
> query
> > > > > optimization, thus, it might potentially outperform two native
> > formats
> > > > > (i.e., AO, Parquet) if it is available.
> > > > >
> > > > > Since there are lots of popular formats available in HDFS
> community,
> > > and
> > > > > more advanced formats are emerging frequently. It is good option
> for
> > > HAWQ
> > > > > to design a general framework that supports pluggable c/c++ formats
> > > such
> > > > as
> > > > > ORC, as well as native format such as AO and Parquet. In designing
> > this
> > > > > framework, we also need to support data stored in different file
> > > systems:
> > > > > HDFS, local disk, amazon S3, etc. Thus, it is better to offer a
> > > framework
> > > > > to support pluggable formats and pluggable file systems.
> > > > >
> > > > > We are proposing support ORC in JIRA (
> > > > > https://issues.apache.org/jira/browse/HAWQ-786). Please see the
> > design
> > > > spec
> > > > > in the JIRA.
> > > > >
> > > > > Your comments are appreciated!
> > > >
> > > > This sounds reasonable, but I'd like to understand the trade-offs
> > > > between supporting
> > > > something like ORC in PXF vs. implementing it natively in C/C++.
> > > >
> > > > Is there any hard performance/etc. data that you could share to
> > > > illuminated the
> > > > tradeoffs between these two approaches?
> > > >
> > >
> > > Implementing it natively in C/C++ will get at least comparable
> > performance
> > > with current native AO and parquet format.
> > >
> > > And we know that ao and parquet is faster than pxf, so we are expecting
> > > better performance here.
> > >
> > > Cheers
> > > Lei
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message