spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Ganelin <>
Subject Re: Breaking the previous large-scale sort record with Spark
Date Sat, 11 Oct 2014 05:09:32 GMT
Hi Matei - I read your post with great interest. Could you possibly comment
in more depth on some of the issues you guys saw when scaling up spark and
how you resolved them? I am interested specifically in spark-related
problems. I'm working on scaling up spark to very large datasets and have
been running into a variety of issues. Thanks in advance!
On Oct 10, 2014 10:54 AM, "Matei Zaharia" <> wrote:

> Hi folks,
> I interrupt your regularly scheduled user / dev list to bring you some
> pretty cool news for the project, which is that we've been able to use
> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
> faster on 10x fewer nodes. There's a detailed writeup at
> Summary: while Hadoop MapReduce held last year's 100 TB world record by
> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
> I want to thank Reynold Xin for leading this effort over the past few
> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
> providing the machines to make this possible. Finally, this result would of
> course not be possible without the many many other contributions, testing
> and feature requests from throughout the community.
> For an engine to scale from these multi-hour petabyte batch jobs down to
> 100-millisecond streaming and interactive queries is quite uncommon, and
> it's thanks to all of you folks that we are able to make this happen.
> Matei
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message