pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohini Palaniswamy <roh...@apache.org>
Subject Re: Avro vs Parquet performance on Pig
Date Mon, 11 Feb 2019 17:14:03 GMT
You might need https://issues.apache.org/jira/browse/PIG-4092

On Thu, Feb 7, 2019 at 3:54 PM Russell Jurney <russell.jurney@gmail.com>
wrote:

> Sorry if this isn't helpful, but the other obvious thing is to store
> intermediate data in Parquet whenever you repeat code/data that can be
> shared between jobs. If tests indicate it is faster. Before Parquet this
> wasn't necessarily advantageous as IO from disk is slower than IO through
> RAM which the computation might be. Parquet open opportunities here by
> competing better with repeat computation. You could compare the two to
> figure out how to optimize your scripts. Again, you're probably doing this
> :)
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Thu, Feb 7, 2019 at 3:29 PM Michael Doo <michael.doo@verve.com> wrote:
>
> > Indeed. When loading Parquet using
> org.apache.parquet.pig.ParquetLoader(),
> > we're specifying the schema for which columns we want to load.
> >
> > On 2/7/19, 5:14 PM, "Russell Jurney" <russell.jurney@gmail.com> wrote:
> >
> >     Well, the obvious thing is to load only those columns you need. Just
> in
> >     case you’re not doing this.
> >
> >     On Thu, Feb 7, 2019 at 2:04 PM Michael Doo <michael.doo@verve.com>
> > wrote:
> >
> >     > Hey all,
> >     > I’ve been migrating some processes over from ingesting Avro to
> > ingesting
> >     > Parquet. In Spark, we’re seeing 2x-8x performance gains when using
> > Parquet
> >     > over Avro. In Pig, similar processes are about the same runtime
> > between the
> >     > two formats (and sometimes even higher using Parquet). We’ve
> enabled
> >     > dictionary filtering as well as predicate filter/pushdown.
> Wondering
> > if
> >     > there are other settings / strategies we might be missing to take
> > advantage
> >     > of Parquet.
> >     >
> >     > Thanks,
> >     > Michael
> >     >
> >     --
> >     Russell Jurney @rjurney <http://twitter.com/rjurney>
> >     russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney>
> FB
> >     <http://facebook.com/jurney> datasyndrome.com
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message