spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Weird worker usage
Date Sat, 26 Sep 2015 19:20:51 GMT
That means only

Thanks
Best Regards

On Sun, Sep 27, 2015 at 12:07 AM, N B <nb.nospam@gmail.com> wrote:

> Hello,
>
> Does anyone have an insight into what could be the issue here?
>
> Thanks
> Nikunj
>
>
> On Fri, Sep 25, 2015 at 10:44 AM, N B <nb.nospam@gmail.com> wrote:
>
>> Hi Akhil,
>>
>> I do have 25 partitions being created. I have set
>> the spark.default.parallelism property to 25. Batch size is 30 seconds and
>> block interval is 1200 ms which also gives us roughly 25 partitions from
>> the input stream. I can see 25 partitions being created and used in the
>> Spark UI also. Its just that those tasks are waiting for cores on N1 to get
>> free before being scheduled while N2 is sitting idle.
>>
>> The cluster configuration is:
>>
>> N1: 2 workers, 6 cores and 16gb memory => 12 cores on the first node.
>> N2: 2 workers, 8 cores and 16 gb memory => 16 cores on the second node.
>>
>> for a grand total of 28 cores. But it still does most of the processing
>> on N1 (divided among the 2 workers running) but almost completely
>> disregarding N2 until its the final stage where data is being written out
>> to Elasticsearch. I am not sure I understand the reason behind it not
>> distributing more partitions to N2 to begin with and use it effectively.
>> Since there are only 12 cores on N1 and 25 total partitions, shouldn't it
>> send some of those partitions to N2 as well?
>>
>> Thanks
>> Nikunj
>>
>>
>> On Fri, Sep 25, 2015 at 5:28 AM, Akhil Das <akhil@sigmoidanalytics.com>
>> wrote:
>>
>>> Parallel tasks totally depends on the # of partitions that you are
>>> having, if you are not receiving sufficient partitions (partitions > total
>>> # cores) then try to do a .repartition.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nospam@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I have a Spark streaming application that reads from a Flume Stream,
>>>> does quite a few maps/filters in addition to a few reduceByKeyAndWindow and
>>>> join operations before writing the analyzed output to ElasticSearch inside
>>>> a foreachRDD()...
>>>>
>>>> I recently started to run this on a 2 node cluster (Standalone) with
>>>> the driver program directly submitting to Spark master on the same host.
>>>> The way I have divided the resources is as follows:
>>>>
>>>> N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores
>>>> each worker)
>>>> N2: 2 spark workers (16 gb + 8 cores each worker).
>>>>
>>>> The application works just fine but it is underusing N2 completely. It
>>>> seems to use N1 (note that both executors on N1 get used) for all the
>>>> analytics but when it comes to writing to Elasticsearch, it does divide the
>>>> data around into all 4 executors which then write to ES on a separate host.
>>>>
>>>> I am puzzled as to why the data is not being distributed evenly from
>>>> the get go into all 4 executors and why would it only do so in the final
>>>> step of the pipeline which seems counterproductive as well?
>>>>
>>>> CPU usage on N1 is near the peak while on N2 is < 10% of overall
>>>> capacity.
>>>>
>>>> Any help in getting the resources more evenly utilized on N1 and N2 is
>>>> welcome.
>>>>
>>>> Thanks in advance,
>>>> Nikunj
>>>>
>>>>
>>>
>>
>

Mime
View raw message