spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jay vyas <jayunit100.apa...@gmail.com>
Subject Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?
Date Mon, 30 Mar 2015 12:27:47 GMT
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that "you can't rely on memory in distributed analytics"...now
maybe we are challenging the assumption that "big data analytics need to
distributed"?

I've been asking the same question lately and seen similarly that spark
performs quite reliably and well on local single node system even for an
app which I ran for a streaming app which I ran for ten days in a row...  I
almost felt guilty that I never put it on a cluster....!
On Mar 30, 2015 5:51 AM, "Steve Loughran" <stevel@hortonworks.com> wrote:

>
>  Note that even the Facebook "four degrees of separation" paper went down
> to a single machine running WebGraph (http://webgraph.di.unimi.it/) for
> the final steps, after running jobs in there Hadoop cluster to build the
> dataset for that final operation.
>
>  "The computations were performed on a 24-core machine with 72 GiB of
> memory and 1 TiB of disk space.6 The first task was to import the Facebook
> graph(s) into a compressed form for WebGraph [4], so that the multiple
> scans required by HyperANF’s diffusive process could be carried out
> relatively quickly."
>
>  Some toolkits/libraries are optimised for that single dedicated use —yet
> are downstream of the raw data; where memory reads $L1-$L3 cache locality
> becomes the main performance problem, and where synchronisation techniques
> like BSP aren't necessarily needed.
>
>
>
>
>  On 29 Mar 2015, at 23:18, Eran Medan <ehrann.mehdan@gmail.com> wrote:
>
>  Hi Sean,
> I think your point about the ETL costs are the wining argument here. but I
> would like to see more research on the topic.
>
> What I would like to see researched - is ability to run a specialized set
> of common algorithms in "fast-local-mode" just like a compiler optimizer
> can decide to inline some methods, or rewrite a recursive function as a for
> loop if it's in tail position, I would say that the future of GraphX can be
> that if a certain algorithm is a well known one (e.g. shortest paths) and
> can be run locally faster than on a distributed set (taking into account
> bringing all the data locally) then it will do so.
>
>  Thanks!
>
> On Sat, Mar 28, 2015 at 1:34 AM, Sean Owen <sowen@cloudera.com> wrote:
>
>> (I bet the Spark implementation could be improved. I bet GraphX could
>> be optimized.)
>>
>> Not sure about this one, but "in core" benchmarks often start by
>> assuming that the data is local. In the real world, data is unlikely
>> to be. The benchmark has to include the cost of bringing all the data
>> to the local computation too, since the point of distributed
>> computation is bringing work to the data.
>>
>> Specialist implementations for a special problem should always win
>> over generalist, and Spark is a generalist. Likewise you can factor
>> matrices way faster in a GPU than in Spark. These aren't entirely
>> either/or propositions; you can use Rust or GPU in a larger
>> distributed program.
>>
>> Typically a real-world problem involves more than core computation:
>> ETL, security, monitoring. Generalists are more likely to have an
>> answer to hand for these.
>>
>> Specialist implementations do just one thing, and they typically have
>> to be custom built. Compare the cost of highly skilled developer time
>> to generalist computing resources; $1m buys several dev years but also
>> rents a small data center.
>>
>> Speed is an important issue but by no means everything in the real
>> world, and these are rarely mutually exclusive options in the OSS
>> world. This is a great piece of work, but I don't think it's some kind
>> of argument against distributed computing.
>>
>>
>> On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan <ehrann.mehdan@gmail.com>
>> wrote:
>> > Remember that article that went viral on HN? (Where a guy showed how
>> GraphX
>> > / Giraph / GraphLab / Spark have worse performance on a 128 cluster
>> than on
>> > a 1 thread machine? if not here is the article
>> > -
>> http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
>> >
>> >
>> > Well as you may recall, this stirred up a lot of commotion in the big
>> data
>> > community (and Spark/GraphX in particular)
>> >
>> > People (justly I guess) blamed him for not really having “big data”, as
>> all
>> > of his data set fits in memory, so it doesn't really count.
>> >
>> >
>> > So he took the challenge and came with a pretty hard to argue counter
>> > benchmark, now with a huge data set (1TB of data, encoded using Hilbert
>> > curves to 154GB, but still large).
>> > see at -
>> >
>> http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
>> >
>> > He provided the source here https://github.com/frankmcsherry/COST as an
>> > example
>> >
>> > His benchmark shows how on a 128 billion edges graph, he got X2 to X10
>> > faster results on a single threaded Rust based implementation
>> >
>> > So, what is the counter argument? it pretty much seems like a blow in
>> the
>> > face of Spark / GraphX etc, (which I like and use on a daily basis)
>> >
>> > Before I dive into re-validating his benchmarks with my own use cases.
>> What
>> > is your opinion on this? If this is the case, then what IS the use case
>> for
>> > using Spark/GraphX at all?
>>
>
>
>

Mime
View raw message