spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Maas <gerard.m...@gmail.com>
Subject Re: What is the best way to consume parallely from multiple topics in Spark Stream with Kafka
Date Wed, 18 Mar 2020 11:15:11 GMT
Hrishi,

Could you share a simplified version of the code you are running?  A job is
made out of tasks.
While jobs are indeed sequential, tasks will be executed in parallel.
In the Spark UI, you can see that in the "Event Timeline" visualization.

If you could share an example of your code that illustrates what you want
to achieve, I could have a look at it.

kr, Gerard.

On Wed, Mar 18, 2020 at 8:24 AM Hrishikesh Mishra <sd.hrishi@gmail.com>
wrote:

> HI Gerard,
>
> First of all, apologies for late reply.
>
> You are right, tasks are distributed to the cluster and parallelism is
> achieve through Kafka partitions.  But my uses case is different, in one
> streaming context, I am consuming events from 6 different topics and for
> each topic  6 different actions are being performed.
>
> So total *Spark jobs = 6 streams X 6 actions = 36 jobs *(plus some Kafka
> commits which happen on drivers) for *a batch.*
>
> These 36 jobs executed sequentially, because at point of time only one job
> is active (see below image). And delay in job leads to delay in complete
> batch.
>
> [image: image.png]
>
>
> But one Job (which is a corresponding a topic and one action), is executed
> parallel based on number of partition that topic has.
>
> [image: image.png]
>
>
> What is my requirement:
>
>    - I want run these jobs in parallel in some controlled manner, like I
>    want to run jobs of different topics in parallal but within a topic job
>    sequentially. We tried with* spark.scheduler.mode: FAIR* and submitted
>    jobs in different pool but didn't get any benefit.
>
> But when I tried with *spark.streaming.concurrentJobs = 4, *then 4 jobs
> are actively running from different batches ( batch time 19:15:55 and batch
> time 19:16:00. ), which could be problem with committing offsets.
>
> [image: image.png]
>
>
> Regards
> Hrishi
>
>
> On Thu, Mar 5, 2020 at 12:49 AM Gerard Maas <gerard.maas@gmail.com> wrote:
>
>> Hi Hrishi,
>>
>> When using the Direct Kafka stream approach, processing tasks will be
>> distributed to the cluster.
>> The level of parallelism is dependent on how many partitions the consumed
>> topics have.
>> Why do you think that the processing is not happening in parallel?
>>
>> I would advise you to get the base scenario working before looking into
>> advanced features like `concurrentJobs` or a particular scheduler.
>>
>> kind regards, Gerard.
>>
>> On Wed, Mar 4, 2020 at 7:42 PM Hrishikesh Mishra <sd.hrishi@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> My spark stream job consumes from multiple Kafka topics. How can I
>>> process parallely? Should I try for *spark.streaming.concurrentJobs,* but
>>> it has some adverse effects as mentioned by the creator. Is it still valid
>>> with Spark 2.4 and Direct Kafka Stream? What about FAIR scheduling mode,
>>> will it help in this scenario. I am not getting any valid links around this.
>>>
>>> Regards
>>> Hrishi
>>>
>>>

Mime
View raw message