spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?
Date Sat, 28 Mar 2015 05:34:30 GMT
(I bet the Spark implementation could be improved. I bet GraphX could
be optimized.)

Not sure about this one, but "in core" benchmarks often start by
assuming that the data is local. In the real world, data is unlikely
to be. The benchmark has to include the cost of bringing all the data
to the local computation too, since the point of distributed
computation is bringing work to the data.

Specialist implementations for a special problem should always win
over generalist, and Spark is a generalist. Likewise you can factor
matrices way faster in a GPU than in Spark. These aren't entirely
either/or propositions; you can use Rust or GPU in a larger
distributed program.

Typically a real-world problem involves more than core computation:
ETL, security, monitoring. Generalists are more likely to have an
answer to hand for these.

Specialist implementations do just one thing, and they typically have
to be custom built. Compare the cost of highly skilled developer time
to generalist computing resources; $1m buys several dev years but also
rents a small data center.

Speed is an important issue but by no means everything in the real
world, and these are rarely mutually exclusive options in the OSS
world. This is a great piece of work, but I don't think it's some kind
of argument against distributed computing.


On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan <ehrann.mehdan@gmail.com> wrote:
> Remember that article that went viral on HN? (Where a guy showed how GraphX
> / Giraph / GraphLab / Spark have worse performance on a 128 cluster than on
> a 1 thread machine? if not here is the article
> -http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
>
>
> Well as you may recall, this stirred up a lot of commotion in the big data
> community (and Spark/GraphX in particular)
>
> People (justly I guess) blamed him for not really having “big data”, as all
> of his data set fits in memory, so it doesn't really count.
>
>
> So he took the challenge and came with a pretty hard to argue counter
> benchmark, now with a huge data set (1TB of data, encoded using Hilbert
> curves to 154GB, but still large).
> see at -
> http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
>
> He provided the source here https://github.com/frankmcsherry/COST as an
> example
>
> His benchmark shows how on a 128 billion edges graph, he got X2 to X10
> faster results on a single threaded Rust based implementation
>
> So, what is the counter argument? it pretty much seems like a blow in the
> face of Spark / GraphX etc, (which I like and use on a daily basis)
>
> Before I dive into re-validating his benchmarks with my own use cases. What
> is your opinion on this? If this is the case, then what IS the use case for
> using Spark/GraphX at all?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message