spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?
Date Thu, 01 Jan 2015 00:21:39 GMT
Whats your spark-submit commands in both cases? Is it Spark Standalone or
YARN (both support client and cluster)? Accordingly what is the number of
executors/cores requested?

TD

On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji <eshioji@gmail.com> wrote:

> Also the job was deployed from the master machine in the cluster.
> ᐧ
>
> On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji <eshioji@gmail.com> wrote:
>
>> Oh sorry that was a edit mistake. The code is essentially:
>>
>>      val msgStream = kafkaStream
>>        .map { case (k, v) => v}
>>        .map(DatatypeConverter.printBase64Binary)
>>        .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>>
>> I.e. there is essentially no original code (I was calling saveAsTextFile
>> in a "save" function but that was just a remnant from previous debugging).
>>
>>
>> ᐧ
>>
>> On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> -dev, +user
>>>
>>> A decent guess: Does your 'save' function entail collecting data back
>>> to the driver? and are you running this from a machine that's not in
>>> your Spark cluster? Then in client mode you're shipping data back to a
>>> less-nearby machine, compared to with cluster mode. That could explain
>>> the bottleneck.
>>>
>>> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <eshioji@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > I have a very, very simple streaming job. When I deploy this on the
>>> exact
>>> > same cluster, with the exact same parameters, I see big (40%)
>>> performance
>>> > difference between "client" and "cluster" deployment mode. This seems
>>> a bit
>>> > surprising.. Is this expected?
>>> >
>>> > The streaming job is:
>>> >
>>> >     val msgStream = kafkaStream
>>> >       .map { case (k, v) => v}
>>> >       .map(DatatypeConverter.printBase64Binary)
>>> >       .foreachRDD(save)
>>> >       .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>>> >
>>> > I tried several times, but the job deployed with "client" mode can only
>>> > write at 60% throughput of the job deployed with "cluster" mode and
>>> this
>>> > happens consistently. I'm logging at INFO level, but my application
>>> code
>>> > doesn't log anything so it's only Spark logs. The logs I see in
>>> "client"
>>> > mode doesn't seem like a crazy amount.
>>> >
>>> > The setup is:
>>> > spark-ec2 [...] \
>>> >   --copy-aws-credentials \
>>> >   --instance-type=m3.2xlarge \
>>> >   -s 2 launch test_cluster
>>> >
>>> > And all the deployment was done from the master machine.
>>> >
>>> > ᐧ
>>>
>>
>>
>

Mime
View raw message