spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "欧阳晋(欧阳晋)" <jin....@alibaba-inc.com>
Subject Re:Breaking the previous large-scale sort record with Spark
Date Mon, 13 Oct 2014 13:10:30 GMT

Great News!

Still some questions for this Sort benchmark
1 How many Map tasks run in the sort? I think the 2,50,000 is the # of reduce tasks. (if the
unsorted file is read from HDFS and use the default 64MB chunk size, the # of map task will
be about 1PB/64MB?)2 How long a single Reduce task run ? I think becuase the #of reduce task
is very large , a single reduce task will not last a long time, so the record the reduce read
in will also be very small?3 How much memory a single Reduce task use in one run? ------------------------------------------------------------------
发件人:Matei Zaharia <matei.zaharia@gmail.com>
发送时间:2014年10月10日(星期五) 22:54
收件人:user <user@spark.apache.org>; dev <dev@spark.apache.org>
主 题:Breaking the previous large-scale sort record with Spark

Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for
the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1
PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
Summary: while Hadoop MapReduce held last year's 100 TB world record by sorting 100 TB in
72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 nodes; and we also scaled up to
sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few weeks, along with Parviz
Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In addition, we'd really like to thank
Amazon's EC2 team for providing the machines to make this possible. Finally, this result would
of course not be possible without the many many other contributions, testing and feature requests
from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to 100-millisecond streaming
and interactive queries is quite uncommon, and it's thanks to all of you folks that we are
able to make this happen.

Matei
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message