spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <>
Subject Re: What benefits do we really get out of colocation?
Date Sat, 03 Dec 2016 09:12:57 GMT
Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive
throughput ) on my spark worker machine when I do `sudo iftop -B`

The problem with instance store on AWS is that they all are ephemeral so
placing Cassandra on top doesn't make a lot of sense. so In short, AWS
doesn't seem to be the right place for colocating in theory. I would still
give you the benefit of doubt and colocate :) but just the numbers are not
reflecting significant margins in terms of performance gains for AWS

On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen <> wrote:

> I'm sure he meant that this is downside to not colocating.
> You are asking the right question. While networking is traditionally much
> slower than disk, that changes a bit in the cloud, where attached storage
> is remote too.
> The disk throughput here is mostly achievable in normal workloads. However
> I think you'll find it's going to be much harder to get 1Gbps out of
> network transfers. That's just the speed of the local interface, and of
> course the transfer speed depends on hops across the network beyond that.
> Network latency is going to be higher than disk too, though that's not as
> much an issue in this context.
> On Sat, Dec 3, 2016 at 8:42 AM kant kodali <> wrote:
>> wait, how is that a benefit? isn't that a bad thing if you are saying
>> colocating leads to more latency  and overall execution time is longer?
>> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
>>> wrote:
>> You get more latency on reads so overall execution time is longer
>> Le 3 déc. 2016 7:39 AM, "kant kodali" <> a écrit :
>> I wonder what benefits do I really I get If I colocate my spark worker
>> process and Cassandra server process on each node?
>> I understand the concept of moving compute towards the data instead of
>> moving data towards computation but It sounds more like one is trying to
>> optimize for network latency.
>> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
>> second) Network throughput.
>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>> so In this case I don't see how colocation can help even if there is one
>> to one mapping from spark worker node to a colocated Cassandra node where
>> say we are doing a table scan of billion rows ?
>> Thanks!

View raw message