The biggest difference I see is that Shark stores data in a Column-oriented form a la C-Store and Vertica, whereas Spark keeps things in row-oriented form.  Chris pointed this out in the RDD[TablePartition] vs RDD[Array[String]] comparison.

I'd be interested in hearing how TablePartition compares to the Parquet format, which has been getting a lot of attention recently.

Personally as far as performance goes, I remember once being surprised that Shark row counting query completed much faster than the equivalent Spark, even after I had both sitting in memory.  This was a select count(*) from TABLE on a cached table in Spark vs a val rdd = sc.textFile(...).cache; rdd.count; in Shark.  I attributed it to the column-oriented format at the time but didn't dig any deeper.

On Wed, Jan 29, 2014 at 11:22 PM, Christopher Nguyen <> wrote:
Hi Chen, it's certainly correct to say it is hard to make an apple-to-apple comparison in terms of being able to assume that there is an implementation-equivalent for any given Shark query, in "Spark only".

That said, I think the results of your comparisons could still be a valuable reference. There are scenarios where perhaps someone wants to consider the trade-offs between implementing some ETL operation with Shark or with only Spark. Some sense of performance/cost difference would be helpful in making that decision.

Christopher T. Nguyen
Co-founder & CEO, Adatao

On Wed, Jan 29, 2014 at 11:10 PM, Chen Jin <> wrote:
Hi Christopher,

Thanks a lot for taking time to explain some details under Shark's
hood. It is probably very hard to make an apple-to-apple comparison
between Shark and Spark since they might be suitable for different
types of tasks. From what you have explained, is it OK to think Shark
is better off for SQL-like tasks, while Spark is more for iterative
machine learning algorithms?



On Wed, Jan 29, 2014 at 8:59 PM, Christopher Nguyen <> wrote:
> Chen, interesting comparisons you're trying to make. It would be great to
> share this somewhere when you're done.
> Some suggestions of non-obvious things to consider:
> In general there are any number of differences between Shark and some
> "equivalent" Spark implementation of the same query.
> Shark isn't necessarily what we may think of as "let's see which lines of
> code accomplish the same thing in Spark". Its current implementation is
> based on Hive which has its own query planning, optimization, and execution.
> Shark's code has some of its own tricks. You can use "EXPLAIN" to see
> Shark's execution plan, and compare to your Spark approach.
> Further Shark has its own memory storage format, e.g., typed-column-oriented
> RDD[TablePartition], that can make it more memory-efficient, and help
> execute many column aggregation queries a lot faster than the row-oriented
> RDD[Array[String]] you may be using.
> In short, Shark does a number of things that are smarter and more optimized
> for SQL queries than a straightforward Spark RDD implementation of the same.
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao
> On Wed, Jan 29, 2014 at 8:10 PM, Chen Jin <> wrote:
>> Hi All,
>> has given a nice benchmark
>> report. I am trying to reproduce the same set of queries in the
>> spark-shell so that we can understand more about shark and spark and
>> their performance on EC2.
>> As for the Aggregation Query when X=8,  Shark-disk takes 210 seconds
>> and Shark-mem takes 111 seconds. However, when I materialize the
>> results to the disk, spark-shell takes more than 5 minutes
>> (reduceByKey is used in the shell for aggregation) . Further, if I
>> cache uservisits RDD, since the dataset is way too big, the
>> performance deteriorates quite a lot.
>> Can anybody shed some light on why there is a more than 2x difference
>> between shark-disk and spark-shell-disk and how to cache data in spark
>> correctly such that we can achieve comparable performance as
>> shark-mem?
>> Thank you very much,
>> -chen