spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hrishikesh Mishra <sd.hri...@gmail.com>
Subject Re: What is the best way to consume parallely from multiple topics in Spark Stream with Kafka
Date Wed, 18 Mar 2020 07:24:16 GMT
HI Gerard,

First of all, apologies for late reply.

You are right, tasks are distributed to the cluster and parallelism is
achieve through Kafka partitions.  But my uses case is different, in one
streaming context, I am consuming events from 6 different topics and for
each topic  6 different actions are being performed.

So total *Spark jobs = 6 streams X 6 actions = 36 jobs *(plus some Kafka
commits which happen on drivers) for *a batch.*

These 36 jobs executed sequentially, because at point of time only one job
is active (see below image). And delay in job leads to delay in complete
batch.

[image: image.png]


But one Job (which is a corresponding a topic and one action), is executed
parallel based on number of partition that topic has.

[image: image.png]


What is my requirement:

   - I want run these jobs in parallel in some controlled manner, like I
   want to run jobs of different topics in parallal but within a topic job
   sequentially. We tried with* spark.scheduler.mode: FAIR* and submitted
   jobs in different pool but didn't get any benefit.

But when I tried with *spark.streaming.concurrentJobs = 4, *then 4 jobs are
actively running from different batches ( batch time 19:15:55 and batch
time 19:16:00. ), which could be problem with committing offsets.

[image: image.png]


Regards
Hrishi


On Thu, Mar 5, 2020 at 12:49 AM Gerard Maas <gerard.maas@gmail.com> wrote:

> Hi Hrishi,
>
> When using the Direct Kafka stream approach, processing tasks will be
> distributed to the cluster.
> The level of parallelism is dependent on how many partitions the consumed
> topics have.
> Why do you think that the processing is not happening in parallel?
>
> I would advise you to get the base scenario working before looking into
> advanced features like `concurrentJobs` or a particular scheduler.
>
> kind regards, Gerard.
>
> On Wed, Mar 4, 2020 at 7:42 PM Hrishikesh Mishra <sd.hrishi@gmail.com>
> wrote:
>
>> Hi
>>
>> My spark stream job consumes from multiple Kafka topics. How can I
>> process parallely? Should I try for *spark.streaming.concurrentJobs,* but
>> it has some adverse effects as mentioned by the creator. Is it still valid
>> with Spark 2.4 and Direct Kafka Stream? What about FAIR scheduling mode,
>> will it help in this scenario. I am not getting any valid links around this.
>>
>> Regards
>> Hrishi
>>
>>

Mime
View raw message