spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <kanth...@gmail.com>
Subject Re: What benefits do we really get out of colocation?
Date Sat, 03 Dec 2016 09:22:20 GMT
GCE seems to have better options. Any one had any experience with GCE?

On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra <
manish.malhotra.work@gmail.com> wrote:

> thanks for sharing number as well !
>
> Now a days even network can be with very high throughput, and might out
> perform the disk, but as Sean mentioned data on network will have other
> dependencies like network hops, like if its across rack, which can have
> switch in between.
>
> But yes people are discussing and talking about Mesos + high performance
> network and not worried about the colocation for various use cases.
>
> AWS emphmerial is not good for reliable storage file system, EBS is the
> expensive alternative :)
>
> On Sat, Dec 3, 2016 at 1:12 AM, kant kodali <kanth909@gmail.com> wrote:
>
>> Thanks Sean! Just for the record I am currently seeing 95 MB/s RX
>> (Receive throughput ) on my spark worker machine when I do `sudo iftop -B`
>>
>> The problem with instance store on AWS is that they all are ephemeral so
>> placing Cassandra on top doesn't make a lot of sense. so In short, AWS
>> doesn't seem to be the right place for colocating in theory. I would still
>> give you the benefit of doubt and colocate :) but just the numbers are not
>> reflecting significant margins in terms of performance gains for AWS
>>
>>
>> On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> I'm sure he meant that this is downside to not colocating.
>>> You are asking the right question. While networking is traditionally
>>> much slower than disk, that changes a bit in the cloud, where attached
>>> storage is remote too.
>>> The disk throughput here is mostly achievable in normal workloads.
>>> However I think you'll find it's going to be much harder to get 1Gbps out
>>> of network transfers. That's just the speed of the local interface, and of
>>> course the transfer speed depends on hops across the network beyond that.
>>> Network latency is going to be higher than disk too, though that's not as
>>> much an issue in this context.
>>>
>>> On Sat, Dec 3, 2016 at 8:42 AM kant kodali <kanth909@gmail.com> wrote:
>>>
>>>> wait, how is that a benefit? isn't that a bad thing if you are saying
>>>> colocating leads to more latency  and overall execution time is longer?
>>>>
>>>> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
>>>> vincent.gromakowski@gmail.com> wrote:
>>>>
>>>> You get more latency on reads so overall execution time is longer
>>>>
>>>> Le 3 déc. 2016 7:39 AM, "kant kodali" <kanth909@gmail.com> a écrit
:
>>>>
>>>>
>>>> I wonder what benefits do I really I get If I colocate my spark worker
>>>> process and Cassandra server process on each node?
>>>>
>>>> I understand the concept of moving compute towards the data instead of
>>>> moving data towards computation but It sounds more like one is trying to
>>>> optimize for network latency.
>>>>
>>>> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
>>>> second) Network throughput.
>>>>
>>>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>>>>
>>>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>>>>
>>>> so In this case I don't see how colocation can help even if there is
>>>> one to one mapping from spark worker node to a colocated Cassandra node
>>>> where say we are doing a table scan of billion rows ?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>
>

Mime
View raw message