spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manish Malhotra <>
Subject Re: What benefits do we really get out of colocation?
Date Sat, 03 Dec 2016 09:16:29 GMT
thanks for sharing number as well !

Now a days even network can be with very high throughput, and might out
perform the disk, but as Sean mentioned data on network will have other
dependencies like network hops, like if its across rack, which can have
switch in between.

But yes people are discussing and talking about Mesos + high performance
network and not worried about the colocation for various use cases.

AWS emphmerial is not good for reliable storage file system, EBS is the
expensive alternative :)

On Sat, Dec 3, 2016 at 1:12 AM, kant kodali <> wrote:

> Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive
> throughput ) on my spark worker machine when I do `sudo iftop -B`
> The problem with instance store on AWS is that they all are ephemeral so
> placing Cassandra on top doesn't make a lot of sense. so In short, AWS
> doesn't seem to be the right place for colocating in theory. I would still
> give you the benefit of doubt and colocate :) but just the numbers are not
> reflecting significant margins in terms of performance gains for AWS
> On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen <> wrote:
>> I'm sure he meant that this is downside to not colocating.
>> You are asking the right question. While networking is traditionally much
>> slower than disk, that changes a bit in the cloud, where attached storage
>> is remote too.
>> The disk throughput here is mostly achievable in normal workloads.
>> However I think you'll find it's going to be much harder to get 1Gbps out
>> of network transfers. That's just the speed of the local interface, and of
>> course the transfer speed depends on hops across the network beyond that.
>> Network latency is going to be higher than disk too, though that's not as
>> much an issue in this context.
>> On Sat, Dec 3, 2016 at 8:42 AM kant kodali <> wrote:
>>> wait, how is that a benefit? isn't that a bad thing if you are saying
>>> colocating leads to more latency  and overall execution time is longer?
>>> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
>>>> wrote:
>>> You get more latency on reads so overall execution time is longer
>>> Le 3 déc. 2016 7:39 AM, "kant kodali" <> a écrit :
>>> I wonder what benefits do I really I get If I colocate my spark worker
>>> process and Cassandra server process on each node?
>>> I understand the concept of moving compute towards the data instead of
>>> moving data towards computation but It sounds more like one is trying to
>>> optimize for network latency.
>>> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
>>> second) Network throughput.
>>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>>> so In this case I don't see how colocation can help even if there is one
>>> to one mapping from spark worker node to a colocated Cassandra node where
>>> say we are doing a table scan of billion rows ?
>>> Thanks!

View raw message