pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Ferreira <lioux...@gmail.com>
Subject Re: Avro vs Parquet performance on Pig
Date Fri, 15 Feb 2019 23:06:48 GMT
I was under the impression that ORC files with snappy compression would
prove to be better unless your processing was columnar in nature.

Isn't that the case?

On Thu, Feb 7, 2019, 21:54 Russell Jurney <russell.jurney@gmail.com> wrote:

> Sorry if this isn't helpful, but the other obvious thing is to store
> intermediate data in Parquet whenever you repeat code/data that can be
> shared between jobs. If tests indicate it is faster. Before Parquet this
> wasn't necessarily advantageous as IO from disk is slower than IO through
> RAM which the computation might be. Parquet open opportunities here by
> competing better with repeat computation. You could compare the two to
> figure out how to optimize your scripts. Again, you're probably doing this
> :)
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Thu, Feb 7, 2019 at 3:29 PM Michael Doo <michael.doo@verve.com> wrote:
>
> > Indeed. When loading Parquet using
> org.apache.parquet.pig.ParquetLoader(),
> > we're specifying the schema for which columns we want to load.
> >
> > On 2/7/19, 5:14 PM, "Russell Jurney" <russell.jurney@gmail.com> wrote:
> >
> >     Well, the obvious thing is to load only those columns you need. Just
> in
> >     case you’re not doing this.
> >
> >     On Thu, Feb 7, 2019 at 2:04 PM Michael Doo <michael.doo@verve.com>
> > wrote:
> >
> >     > Hey all,
> >     > I’ve been migrating some processes over from ingesting Avro to
> > ingesting
> >     > Parquet. In Spark, we’re seeing 2x-8x performance gains when using
> > Parquet
> >     > over Avro. In Pig, similar processes are about the same runtime
> > between the
> >     > two formats (and sometimes even higher using Parquet). We’ve
> enabled
> >     > dictionary filtering as well as predicate filter/pushdown.
> Wondering
> > if
> >     > there are other settings / strategies we might be missing to take
> > advantage
> >     > of Parquet.
> >     >
> >     > Thanks,
> >     > Michael
> >     >
> >     --
> >     Russell Jurney @rjurney <http://twitter.com/rjurney>
> >     russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney>
> FB
> >     <http://facebook.com/jurney> datasyndrome.com
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message