Does anyone have an insight into what could be the issue here?


On Fri, Sep 25, 2015 at 10:44 AM, N B <nb.nospam@gmail.com> wrote:
Hi Akhil,

I do have 25 partitions being created. I have set the spark.default.parallelism property to 25. Batch size is 30 seconds and block interval is 1200 ms which also gives us roughly 25 partitions from the input stream. I can see 25 partitions being created and used in the Spark UI also. Its just that those tasks are waiting for cores on N1 to get free before being scheduled while N2 is sitting idle.

The cluster configuration is:

N1: 2 workers, 6 cores and 16gb memory => 12 cores on the first node.
N2: 2 workers, 8 cores and 16 gb memory => 16 cores on the second node.

for a grand total of 28 cores. But it still does most of the processing on N1 (divided among the 2 workers running) but almost completely disregarding N2 until its the final stage where data is being written out to Elasticsearch. I am not sure I understand the reason behind it not distributing more partitions to N2 to begin with and use it effectively. Since there are only 12 cores on N1 and 25 total partitions, shouldn't it send some of those partitions to N2 as well?


On Fri, Sep 25, 2015 at 5:28 AM, Akhil Das <akhil@sigmoidanalytics.com> wrote:
Parallel tasks totally depends on the # of partitions that you are having, if you are not receiving sufficient partitions (partitions > total # cores) then try to do a .repartition.

Best Regards

On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nospam@gmail.com> wrote:
Hello all,

I have a Spark streaming application that reads from a Flume Stream, does quite a few maps/filters in addition to a few reduceByKeyAndWindow and join operations before writing the analyzed output to ElasticSearch inside a foreachRDD()...

I recently started to run this on a 2 node cluster (Standalone) with the driver program directly submitting to Spark master on the same host. The way I have divided the resources is as follows:

N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores each worker)
N2: 2 spark workers (16 gb + 8 cores each worker).

The application works just fine but it is underusing N2 completely. It seems to use N1 (note that both executors on N1 get used) for all the analytics but when it comes to writing to Elasticsearch, it does divide the data around into all 4 executors which then write to ES on a separate host.

I am puzzled as to why the data is not being distributed evenly from the get go into all 4 executors and why would it only do so in the final step of the pipeline which seems counterproductive as well?

CPU usage on N1 is near the peak while on N2 is < 10% of overall capacity.

Any help in getting the resources more evenly utilized on N1 and N2 is welcome.

Thanks in advance,