spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Saputra <>
Subject Re: Breaking the previous large-scale sort record with Spark
Date Sat, 11 Oct 2014 06:04:07 GMT
Congrats to Reynold et al leading this effort!

- Henry

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia <> wrote:
> Hi folks,
> I interrupt your regularly scheduled user / dev list to bring you some pretty cool news
for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and
1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at
Summary: while Hadoop MapReduce held last year's 100 TB world record by sorting 100 TB in
72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 nodes; and we also scaled up to
sort 1 PB in 234 minutes.
> I want to thank Reynold Xin for leading this effort over the past few weeks, along with
Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In addition, we'd really like
to thank Amazon's EC2 team for providing the machines to make this possible. Finally, this
result would of course not be possible without the many many other contributions, testing
and feature requests from throughout the community.
> For an engine to scale from these multi-hour petabyte batch jobs down to 100-millisecond
streaming and interactive queries is quite uncommon, and it's thanks to all of you folks that
we are able to make this happen.
> Matei
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message