spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Shark 0.8.0 release
Date Sat, 19 Oct 2013 02:07:38 GMT
We are happy to announce Shark 0.8.0, which is a major release the brings
many new capabilities and performance improvements. You can download the
release here:
Shuffle Performance for Large Aggregations and Joins

We’ve implemented a new data serialization format that substantially
improved shuffle performance in the case of large aggregations and joins.
The new format is more CPU-efficient, while also reducing the size of the
data sent across the network. This can improve performance by up to 3X for
queries that have large aggregations or joins.
In-memory Columnar Compression

Memory is precious. To enable you fitting more data into memory, Shark now
implements CPU-efficient compression algorithms, including dictionary
encoding and run-length encoding. In addition to using less space in-memory
compression actually improves the response time of many queries. This is
because it reduces GC pressure and improves locality leading to better CPU
cache performance. The compression ratio is workload-dependent, however, we
have seen anywhere from 2X to 30X compression in real-workloads.

There is also no need to worry about picking the best compression scheme.
When first loading the data into memory, Shark will automatically determine
the best scheme to apply for the given dataset.
Partition Pruning aka Data Skipping for In-memory Tables

A typical query usually only looks at a small subset of overall data.
Partition pruning allows Shark to skip looking at partitions that it knows
for sure does not contain any data satisfying the query predicates. For one
early user of Shark, this allowed query processing to skip examining 98% of
the data.

Different from Hive's partitioning feature, partition pruning refers to
Shark's usage of column statistics - collected during in-memory data
materialization - to automatically reduce the number of RDD partitions that
need to be scanned.
Spark 0.8.0 Support

First and foremost, through its Spark 0.8.0 support, this new version of
Shark supports a number of important features, including:

   - Web-based monitoring UI for cluster memory and job progress
   - Dropping a cached table releases its memory occupation
   - Improved scheduling support (including fair scheduling, topology-aware

Fair Scheduling

Spark’s internal job scheduler has been refactored and extended to include
more sophisticated scheduling policies such as fair scheduling. The fair
scheduler a fair scheduler allows multiple users to share an instance of
Spark, which helps users running shorter jobs to achieve good performance,
even when longer-running jobs are running in parallel.

Shark users can also take advantage of this new capability by setting the
configuration variablespark.scheduler.cluster.fair.pool to a specific
scheduling pool at runtime. For example:

set mapred.fairscheduler.pool=short_query_pool;
select count(*) from my_shark_in_memory_table;

Build and Development

A continuous integration script has been added that would automatically
fetch all the Shark dependencies (Scala, Hive, Spark) and execute both the
Shark internal unit tests and the Hive compatibility unit tests. This has
been used in various places as part of their Jenkins pipeline.

Users can now build Shark against specific versions of Hadoop without
modifying the build file. Simply specify the Hadoop version using the
SHARK_HADOOP_VERSION environmental variable before running the build.

SHARK_HADOOP_VERSION=1.0.5 sbt/sbt package

Other Improvements

   - Reduced info level logging verbosity.
   - When connecting to a remote server, the Shark CLI no longer needs to
   launch a local SparkContext.
   - Various improvements to the experimental Tachyon support.
   - Stability improvement for map join.
   - Improved LIMIT performance for highly selective queries.

We would like to thank Konstantin Boudnik, Jason Dai, Harvey Feng, Sarah
Gerweck, Jason Giedymin, Cheng Hao, Mark Hamstra, Jon Hartlaub, Shane
Huang, Nandu Jayakumar, Andy Konwinski, Haoyuan Li, Harold Lim, Raymond
Liu, Antonio Lupher, Kay Ousterhout, Alexander Pivovarov, Sun Rui, Andre
Schumacher, Mingfei Shi, Amir Shimoni, Ram Sriharsha, Patrick Wendell,
Andrew Xia, Matei Zaharia, and Lin Zhao for their contributions to the

View raw message