spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Please Help: Amplab Benchmark Performance
Date Thu, 30 Jan 2014 07:28:41 GMT
The biggest difference I see is that Shark stores data in a Column-oriented
form a la C-Store and Vertica, whereas Spark keeps things in row-oriented
form.  Chris pointed this out in the RDD[TablePartition] vs
RDD[Array[String]] comparison.

I'd be interested in hearing how TablePartition compares to the Parquet
format, which has been getting a lot of attention recently.
https://github.com/Parquet/parquet-format

Personally as far as performance goes, I remember once being surprised that
Shark row counting query completed much faster than the equivalent Spark,
even after I had both sitting in memory.  This was a select count(*) from
TABLE on a cached table in Spark vs a val rdd = sc.textFile(...).cache;
rdd.count; in Shark.  I attributed it to the column-oriented format at the
time but didn't dig any deeper.


On Wed, Jan 29, 2014 at 11:22 PM, Christopher Nguyen <ctn@adatao.com> wrote:

> Hi Chen, it's certainly correct to say it is hard to make an
> apple-to-apple comparison in terms of being able to assume that there is an
> implementation-equivalent for any given Shark query, in "Spark only".
>
> That said, I think the results of your comparisons could still be a
> valuable reference. There are scenarios where perhaps someone wants to
> consider the trade-offs between implementing some ETL operation with Shark
> or with only Spark. Some sense of performance/cost difference would be
> helpful in making that decision.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Wed, Jan 29, 2014 at 11:10 PM, Chen Jin <karen.cj@gmail.com> wrote:
>
>> Hi Christopher,
>>
>> Thanks a lot for taking time to explain some details under Shark's
>> hood. It is probably very hard to make an apple-to-apple comparison
>> between Shark and Spark since they might be suitable for different
>> types of tasks. From what you have explained, is it OK to think Shark
>> is better off for SQL-like tasks, while Spark is more for iterative
>> machine learning algorithms?
>>
>> Cheers,
>>
>> -chen
>>
>> On Wed, Jan 29, 2014 at 8:59 PM, Christopher Nguyen <ctn@adatao.com>
>> wrote:
>> > Chen, interesting comparisons you're trying to make. It would be great
>> to
>> > share this somewhere when you're done.
>> >
>> > Some suggestions of non-obvious things to consider:
>> >
>> > In general there are any number of differences between Shark and some
>> > "equivalent" Spark implementation of the same query.
>> >
>> > Shark isn't necessarily what we may think of as "let's see which lines
>> of
>> > code accomplish the same thing in Spark". Its current implementation is
>> > based on Hive which has its own query planning, optimization, and
>> execution.
>> > Shark's code has some of its own tricks. You can use "EXPLAIN" to see
>> > Shark's execution plan, and compare to your Spark approach.
>> >
>> > Further Shark has its own memory storage format, e.g.,
>> typed-column-oriented
>> > RDD[TablePartition], that can make it more memory-efficient, and help
>> > execute many column aggregation queries a lot faster than the
>> row-oriented
>> > RDD[Array[String]] you may be using.
>> >
>> > In short, Shark does a number of things that are smarter and more
>> optimized
>> > for SQL queries than a straightforward Spark RDD implementation of the
>> same.
>> > --
>> > Christopher T. Nguyen
>> > Co-founder & CEO, Adatao
>> > linkedin.com/in/ctnguyen
>> >
>> >
>> >
>> > On Wed, Jan 29, 2014 at 8:10 PM, Chen Jin <karen.cj@gmail.com> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark
>> >> report. I am trying to reproduce the same set of queries in the
>> >> spark-shell so that we can understand more about shark and spark and
>> >> their performance on EC2.
>> >>
>> >> As for the Aggregation Query when X=8,  Shark-disk takes 210 seconds
>> >> and Shark-mem takes 111 seconds. However, when I materialize the
>> >> results to the disk, spark-shell takes more than 5 minutes
>> >> (reduceByKey is used in the shell for aggregation) . Further, if I
>> >> cache uservisits RDD, since the dataset is way too big, the
>> >> performance deteriorates quite a lot.
>> >>
>> >> Can anybody shed some light on why there is a more than 2x difference
>> >> between shark-disk and spark-shell-disk and how to cache data in spark
>> >> correctly such that we can achieve comparable performance as
>> >> shark-mem?
>> >>
>> >> Thank you very much,
>> >>
>> >> -chen
>> >
>> >
>>
>
>

Mime
View raw message