hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lei Chang <lei_ch...@apache.org>
Subject Re: Support orc format
Date Wed, 22 Jun 2016 07:28:43 GMT
On Wed, Jun 22, 2016 at 1:39 AM, Goden Yao <godenyao@apache.org> wrote:

> This is not comparable as native vs. external.
> The design doc attached in HAWQ-786
> <https://issues.apache.org/jira/browse/HAWQ-786>, as some community
> responses in the JIRA, is mixing up an External Table data access framework
> with a file format support.
>
> If the JIRA is merely about using ORC as native file format as we see its
> popularity in the Hadoop community and potentially want to replace parquet
> with ORC as default for its benefits and advantages, this JIRA should be
> focusing on the native file format part and how to integrate with C library
> from Apache ORC project.
>


as it was described in the JIRA. the framework is designed as a general
framework.

it can also potentially be used for external data. there is an example
showing the usage.


>
> To answer Roman's questions, I think we first need to understand user
> scenario with external tables (with ORC format), which is users :
> 1) already have ORC files landed in HDFS (or stored as Hive tables)
> 2) want to query from HAWQ, so they may get performance gain with MPP
> architecture provided by HAWQ, instead of MR jobs.
> 3) want to avoid data duplication, which means they don't want to load data
> into HAWQ native format (so doesn't matter what native format HAWQ uses to
> store the table)
>
> Given that, I think it's worth a further discussion in the theme of
> improving external data source access/query performance.
>
> Thanks
> -Goden
>
>
>
> On Mon, Jun 20, 2016 at 5:55 PM Lei Chang <lei_chang@apache.org> wrote:
>
> > On Tue, Jun 21, 2016 at 8:38 AM, Roman Shaposhnik <roman@shaposhnik.org>
> > wrote:
> >
> > > On Fri, Jun 17, 2016 at 3:02 AM, Ming Li <mli@pivotal.io> wrote:
> > > > Hi Guys,
> > > >
> > > > ORC (Optimized Row Columnar) is a very popular open source format
> > adopted
> > > > in some major components in Hadoop eco-system. It is also used by a
> lot
> > > of
> > > > users. The advantages of supporting ORC storage in HAWQ are in two
> > folds:
> > > > firstly, it makes HAWQ more Hadoop native which interacts with other
> > > > components more easily; secondly, ORC stores some meta info for query
> > > > optimization, thus, it might potentially outperform two native
> formats
> > > > (i.e., AO, Parquet) if it is available.
> > > >
> > > > Since there are lots of popular formats available in HDFS community,
> > and
> > > > more advanced formats are emerging frequently. It is good option for
> > HAWQ
> > > > to design a general framework that supports pluggable c/c++ formats
> > such
> > > as
> > > > ORC, as well as native format such as AO and Parquet. In designing
> this
> > > > framework, we also need to support data stored in different file
> > systems:
> > > > HDFS, local disk, amazon S3, etc. Thus, it is better to offer a
> > framework
> > > > to support pluggable formats and pluggable file systems.
> > > >
> > > > We are proposing support ORC in JIRA (
> > > > https://issues.apache.org/jira/browse/HAWQ-786). Please see the
> design
> > > spec
> > > > in the JIRA.
> > > >
> > > > Your comments are appreciated!
> > >
> > > This sounds reasonable, but I'd like to understand the trade-offs
> > > between supporting
> > > something like ORC in PXF vs. implementing it natively in C/C++.
> > >
> > > Is there any hard performance/etc. data that you could share to
> > > illuminated the
> > > tradeoffs between these two approaches?
> > >
> >
> > Implementing it natively in C/C++ will get at least comparable
> performance
> > with current native AO and parquet format.
> >
> > And we know that ao and parquet is faster than pxf, so we are expecting
> > better performance here.
> >
> > Cheers
> > Lei
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message