spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <kanth...@gmail.com>
Subject Re: What benefits do we really get out of colocation?
Date Sat, 03 Dec 2016 10:05:56 GMT
hmm GCE pretty much seems to follow the same model as AWS.

On Sat, Dec 3, 2016 at 1:22 AM, kant kodali <kanth909@gmail.com> wrote:

> GCE seems to have better options. Any one had any experience with GCE?
>
> On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra <
> manish.malhotra.work@gmail.com> wrote:
>
>> thanks for sharing number as well !
>>
>> Now a days even network can be with very high throughput, and might out
>> perform the disk, but as Sean mentioned data on network will have other
>> dependencies like network hops, like if its across rack, which can have
>> switch in between.
>>
>> But yes people are discussing and talking about Mesos + high performance
>> network and not worried about the colocation for various use cases.
>>
>> AWS emphmerial is not good for reliable storage file system, EBS is the
>> expensive alternative :)
>>
>> On Sat, Dec 3, 2016 at 1:12 AM, kant kodali <kanth909@gmail.com> wrote:
>>
>>> Thanks Sean! Just for the record I am currently seeing 95 MB/s RX
>>> (Receive throughput ) on my spark worker machine when I do `sudo iftop -B`
>>>
>>> The problem with instance store on AWS is that they all are ephemeral so
>>> placing Cassandra on top doesn't make a lot of sense. so In short, AWS
>>> doesn't seem to be the right place for colocating in theory. I would still
>>> give you the benefit of doubt and colocate :) but just the numbers are not
>>> reflecting significant margins in terms of performance gains for AWS
>>>
>>>
>>> On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen <sowen@cloudera.com> wrote:
>>>
>>>> I'm sure he meant that this is downside to not colocating.
>>>> You are asking the right question. While networking is traditionally
>>>> much slower than disk, that changes a bit in the cloud, where attached
>>>> storage is remote too.
>>>> The disk throughput here is mostly achievable in normal workloads.
>>>> However I think you'll find it's going to be much harder to get 1Gbps out
>>>> of network transfers. That's just the speed of the local interface, and of
>>>> course the transfer speed depends on hops across the network beyond that.
>>>> Network latency is going to be higher than disk too, though that's not as
>>>> much an issue in this context.
>>>>
>>>> On Sat, Dec 3, 2016 at 8:42 AM kant kodali <kanth909@gmail.com> wrote:
>>>>
>>>>> wait, how is that a benefit? isn't that a bad thing if you are saying
>>>>> colocating leads to more latency  and overall execution time is longer?
>>>>>
>>>>> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
>>>>> vincent.gromakowski@gmail.com> wrote:
>>>>>
>>>>> You get more latency on reads so overall execution time is longer
>>>>>
>>>>> Le 3 déc. 2016 7:39 AM, "kant kodali" <kanth909@gmail.com> a écrit
:
>>>>>
>>>>>
>>>>> I wonder what benefits do I really I get If I colocate my spark worker
>>>>> process and Cassandra server process on each node?
>>>>>
>>>>> I understand the concept of moving compute towards the data instead of
>>>>> moving data towards computation but It sounds more like one is trying
to
>>>>> optimize for network latency.
>>>>>
>>>>> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
>>>>> second) Network throughput.
>>>>>
>>>>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>>>>>
>>>>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>>>>>
>>>>> so In this case I don't see how colocation can help even if there is
>>>>> one to one mapping from spark worker node to a colocated Cassandra node
>>>>> where say we are doing a table scan of billion rows ?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>
>>
>

Mime
View raw message