spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Weird worker usage
Date Sat, 26 Sep 2015 19:22:16 GMT
That means only a single receiver is doing all the work and hence the data
is local to your N1 machine and hence all tasks are executed there. Now to
get the data to N2, you need to do either a .repartition or set the
StorageLevel MEMORY*_2 where _2 enables the data replication and i guess
that will solve your problem.

Thanks
Best Regards

On Sun, Sep 27, 2015 at 12:50 AM, Akhil Das <akhil@sigmoidanalytics.com>
wrote:

> That means only
>
> Thanks
> Best Regards
>
> On Sun, Sep 27, 2015 at 12:07 AM, N B <nb.nospam@gmail.com> wrote:
>
>> Hello,
>>
>> Does anyone have an insight into what could be the issue here?
>>
>> Thanks
>> Nikunj
>>
>>
>> On Fri, Sep 25, 2015 at 10:44 AM, N B <nb.nospam@gmail.com> wrote:
>>
>>> Hi Akhil,
>>>
>>> I do have 25 partitions being created. I have set
>>> the spark.default.parallelism property to 25. Batch size is 30 seconds and
>>> block interval is 1200 ms which also gives us roughly 25 partitions from
>>> the input stream. I can see 25 partitions being created and used in the
>>> Spark UI also. Its just that those tasks are waiting for cores on N1 to get
>>> free before being scheduled while N2 is sitting idle.
>>>
>>> The cluster configuration is:
>>>
>>> N1: 2 workers, 6 cores and 16gb memory => 12 cores on the first node.
>>> N2: 2 workers, 8 cores and 16 gb memory => 16 cores on the second node.
>>>
>>> for a grand total of 28 cores. But it still does most of the processing
>>> on N1 (divided among the 2 workers running) but almost completely
>>> disregarding N2 until its the final stage where data is being written out
>>> to Elasticsearch. I am not sure I understand the reason behind it not
>>> distributing more partitions to N2 to begin with and use it effectively.
>>> Since there are only 12 cores on N1 and 25 total partitions, shouldn't it
>>> send some of those partitions to N2 as well?
>>>
>>> Thanks
>>> Nikunj
>>>
>>>
>>> On Fri, Sep 25, 2015 at 5:28 AM, Akhil Das <akhil@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Parallel tasks totally depends on the # of partitions that you are
>>>> having, if you are not receiving sufficient partitions (partitions > total
>>>> # cores) then try to do a .repartition.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nospam@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I have a Spark streaming application that reads from a Flume Stream,
>>>>> does quite a few maps/filters in addition to a few reduceByKeyAndWindow
and
>>>>> join operations before writing the analyzed output to ElasticSearch inside
>>>>> a foreachRDD()...
>>>>>
>>>>> I recently started to run this on a 2 node cluster (Standalone) with
>>>>> the driver program directly submitting to Spark master on the same host.
>>>>> The way I have divided the resources is as follows:
>>>>>
>>>>> N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores
>>>>> each worker)
>>>>> N2: 2 spark workers (16 gb + 8 cores each worker).
>>>>>
>>>>> The application works just fine but it is underusing N2 completely. It
>>>>> seems to use N1 (note that both executors on N1 get used) for all the
>>>>> analytics but when it comes to writing to Elasticsearch, it does divide
the
>>>>> data around into all 4 executors which then write to ES on a separate
host.
>>>>>
>>>>> I am puzzled as to why the data is not being distributed evenly from
>>>>> the get go into all 4 executors and why would it only do so in the final
>>>>> step of the pipeline which seems counterproductive as well?
>>>>>
>>>>> CPU usage on N1 is near the peak while on N2 is < 10% of overall
>>>>> capacity.
>>>>>
>>>>> Any help in getting the resources more evenly utilized on N1 and N2 is
>>>>> welcome.
>>>>>
>>>>> Thanks in advance,
>>>>> Nikunj
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message