spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?
Date Fri, 27 Mar 2015 20:25:23 GMT
Hallo,

Well all problems you want to solve with technology need to have good
justification for a certain technology. So the first thing is that you ask
which technology fits to my current and future problems. This is also what
the article says. Unfortunately, it does only provide a vague answer why
there is this performance gap. Is it a Spark architecture issue? Is it a
configuration issue? Is it a design issue of the spark version of the
algorithms? Is it an amazon issue? Why did he use a laptop and not a single
Amazon machine to compare? Why did he not run multiple threads on a single
machine (for some problems single thread might be the fastest solution
anyway)?

Based on my experience a single machine can be already quiet useful for
graph algorithms. There are also different graph systems all for different
purposes. Spark Graphx is more general (can be used in combination with the
whole Spark Plattform!) and probably less performant than highly specialed
graph systems leveraging GPU etc. - These systems have the disadvantage
that they are not generally suitable or integrated with other types of
processing, such as streaming, mr, rdd, etc.

I am always curios for any technology why and where do one looses
performance. That's why one does proof-of-concepts and evaluates technology
depending on the business case. Maybe the article is right, but it is
unclear if it can be generalized or if it really has an impact of your
business case for Spark/Graphx. His algorithms can only do graph processing
for a very special case and are not suitable for a general all-purpose big
data infrastructure.

Best regards
 Le 27 mars 2015 19:33, "Eran Medan" <ehrann.mehdan@gmail.com> a écrit :

> Remember that article that went viral on HN? (Where a guy showed how
> GraphX / Giraph / GraphLab / Spark have worse performance on a 128 cluster
> than on a 1 thread machine? if not here is the article -
> http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
>
>
> Well as you may recall, this stirred up a lot of commotion in the big data
> community (and Spark/GraphX in particular)
>
> People (justly I guess) blamed him for not really having “big data”, as
> all of his data set fits in memory, so it doesn't really count.
>
>
> So he took the challenge and came with a pretty hard to argue counter
> benchmark, now with a huge data set (1TB of data, encoded using Hilbert
> curves to 154GB, but still large).
> see at -
> http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
>
> He provided the source here https://github.com/frankmcsherry/COST as an
> example
>
> His benchmark shows how on a 128 billion edges graph, he got X2 to X10
> faster results on a single threaded Rust based implementation
>
> So, what is the counter argument? it pretty much seems like a blow in the
> face of Spark / GraphX etc, (which I like and use on a daily basis)
>
> Before I dive into re-validating his benchmarks with my own use cases.
> What is your opinion on this? If this is the case, then what IS the use
> case for using Spark/GraphX at all?
>

Mime
View raw message