spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Nguyen <...@adatao.com>
Subject Re: Please Help: Amplab Benchmark Performance
Date Thu, 30 Jan 2014 07:22:33 GMT
Hi Chen, it's certainly correct to say it is hard to make an apple-to-apple
comparison in terms of being able to assume that there is an
implementation-equivalent for any given Shark query, in "Spark only".

That said, I think the results of your comparisons could still be a
valuable reference. There are scenarios where perhaps someone wants to
consider the trade-offs between implementing some ETL operation with Shark
or with only Spark. Some sense of performance/cost difference would be
helpful in making that decision.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Wed, Jan 29, 2014 at 11:10 PM, Chen Jin <karen.cj@gmail.com> wrote:

> Hi Christopher,
>
> Thanks a lot for taking time to explain some details under Shark's
> hood. It is probably very hard to make an apple-to-apple comparison
> between Shark and Spark since they might be suitable for different
> types of tasks. From what you have explained, is it OK to think Shark
> is better off for SQL-like tasks, while Spark is more for iterative
> machine learning algorithms?
>
> Cheers,
>
> -chen
>
> On Wed, Jan 29, 2014 at 8:59 PM, Christopher Nguyen <ctn@adatao.com>
> wrote:
> > Chen, interesting comparisons you're trying to make. It would be great to
> > share this somewhere when you're done.
> >
> > Some suggestions of non-obvious things to consider:
> >
> > In general there are any number of differences between Shark and some
> > "equivalent" Spark implementation of the same query.
> >
> > Shark isn't necessarily what we may think of as "let's see which lines of
> > code accomplish the same thing in Spark". Its current implementation is
> > based on Hive which has its own query planning, optimization, and
> execution.
> > Shark's code has some of its own tricks. You can use "EXPLAIN" to see
> > Shark's execution plan, and compare to your Spark approach.
> >
> > Further Shark has its own memory storage format, e.g.,
> typed-column-oriented
> > RDD[TablePartition], that can make it more memory-efficient, and help
> > execute many column aggregation queries a lot faster than the
> row-oriented
> > RDD[Array[String]] you may be using.
> >
> > In short, Shark does a number of things that are smarter and more
> optimized
> > for SQL queries than a straightforward Spark RDD implementation of the
> same.
> > --
> > Christopher T. Nguyen
> > Co-founder & CEO, Adatao
> > linkedin.com/in/ctnguyen
> >
> >
> >
> > On Wed, Jan 29, 2014 at 8:10 PM, Chen Jin <karen.cj@gmail.com> wrote:
> >>
> >> Hi All,
> >>
> >> https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark
> >> report. I am trying to reproduce the same set of queries in the
> >> spark-shell so that we can understand more about shark and spark and
> >> their performance on EC2.
> >>
> >> As for the Aggregation Query when X=8,  Shark-disk takes 210 seconds
> >> and Shark-mem takes 111 seconds. However, when I materialize the
> >> results to the disk, spark-shell takes more than 5 minutes
> >> (reduceByKey is used in the shell for aggregation) . Further, if I
> >> cache uservisits RDD, since the dataset is way too big, the
> >> performance deteriorates quite a lot.
> >>
> >> Can anybody shed some light on why there is a more than 2x difference
> >> between shark-disk and spark-shell-disk and how to cache data in spark
> >> correctly such that we can achieve comparable performance as
> >> shark-mem?
> >>
> >> Thank you very much,
> >>
> >> -chen
> >
> >
>

Mime
View raw message