spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From N B <nb.nos...@gmail.com>
Subject Re: Weird worker usage
Date Sat, 26 Sep 2015 18:37:16 GMT
Hello,

Does anyone have an insight into what could be the issue here?

Thanks
Nikunj


On Fri, Sep 25, 2015 at 10:44 AM, N B <nb.nospam@gmail.com> wrote:

> Hi Akhil,
>
> I do have 25 partitions being created. I have set
> the spark.default.parallelism property to 25. Batch size is 30 seconds and
> block interval is 1200 ms which also gives us roughly 25 partitions from
> the input stream. I can see 25 partitions being created and used in the
> Spark UI also. Its just that those tasks are waiting for cores on N1 to get
> free before being scheduled while N2 is sitting idle.
>
> The cluster configuration is:
>
> N1: 2 workers, 6 cores and 16gb memory => 12 cores on the first node.
> N2: 2 workers, 8 cores and 16 gb memory => 16 cores on the second node.
>
> for a grand total of 28 cores. But it still does most of the processing on
> N1 (divided among the 2 workers running) but almost completely disregarding
> N2 until its the final stage where data is being written out to
> Elasticsearch. I am not sure I understand the reason behind it not
> distributing more partitions to N2 to begin with and use it effectively.
> Since there are only 12 cores on N1 and 25 total partitions, shouldn't it
> send some of those partitions to N2 as well?
>
> Thanks
> Nikunj
>
>
> On Fri, Sep 25, 2015 at 5:28 AM, Akhil Das <akhil@sigmoidanalytics.com>
> wrote:
>
>> Parallel tasks totally depends on the # of partitions that you are
>> having, if you are not receiving sufficient partitions (partitions > total
>> # cores) then try to do a .repartition.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nospam@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I have a Spark streaming application that reads from a Flume Stream,
>>> does quite a few maps/filters in addition to a few reduceByKeyAndWindow and
>>> join operations before writing the analyzed output to ElasticSearch inside
>>> a foreachRDD()...
>>>
>>> I recently started to run this on a 2 node cluster (Standalone) with the
>>> driver program directly submitting to Spark master on the same host. The
>>> way I have divided the resources is as follows:
>>>
>>> N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores each
>>> worker)
>>> N2: 2 spark workers (16 gb + 8 cores each worker).
>>>
>>> The application works just fine but it is underusing N2 completely. It
>>> seems to use N1 (note that both executors on N1 get used) for all the
>>> analytics but when it comes to writing to Elasticsearch, it does divide the
>>> data around into all 4 executors which then write to ES on a separate host.
>>>
>>> I am puzzled as to why the data is not being distributed evenly from the
>>> get go into all 4 executors and why would it only do so in the final step
>>> of the pipeline which seems counterproductive as well?
>>>
>>> CPU usage on N1 is near the peak while on N2 is < 10% of overall
>>> capacity.
>>>
>>> Any help in getting the resources more evenly utilized on N1 and N2 is
>>> welcome.
>>>
>>> Thanks in advance,
>>> Nikunj
>>>
>>>
>>
>

Mime
View raw message