spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <>
Subject Re: [SPARK STREAMING] Concurrent operations in spark streaming
Date Mon, 26 Oct 2015 10:42:50 GMT
Thinking more about it – it should only be 2 tasks as A and B are most likely collapsed by
spark in a single task.

Again – learn to use the spark UI as it’s really informative. The combination of DAG visualization
and task count should answer most of your questions.


From: Adrian Tanase
Date: Monday, October 26, 2015 at 11:57 AM
To: Nipun Arora, Andy Dang
Cc: user
Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming

If I understand the order correctly, not really. First of all, the easiest way to make sure
it works as expected is to check out the visual DAG in the spark UI.

It should map 1:1 to your code, and since I don’t see any shuffles in the operations below
it should execute all in one stage, forking after X.
That means that all the executor cores will each process a partition completely in isolation,
most likely 3 tasks (A, B, X2). Most likely in the order you define in code although depending
on the data some tasks may get skipped or moved around.

I’m curious – why do you ask? Do you have a particular concern or use case that relies
on ordering between A, B and X2?


From: Nipun Arora
Date: Sunday, October 25, 2015 at 4:09 PM
To: Andy Dang
Cc: user
Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming

So essentially the driver/client program needs to explicitly have two threads to ensure concurrency?

What happens when the program is sequential... I.e. I execute function A and then function
B. Does this mean that each RDD first goes through function A, and them stream X is persisted,
but processed in function B only after the RDD has been processed by A?

On Sat, Oct 24, 2015 at 5:32 PM Andy Dang <<>>
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the
driver then both of them will be executed in parallel. Whichever gets submitted to Spark first
gets executed first - you can use a semaphore if you need to ensure the ordering of execution,
though I would assume that the ordering wouldn't matter.


On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora <<>>
I wanted to understand something about the internals of spark streaming executions.

If I have a stream X, and in my program I send stream X to function A and function B:

1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream
Z. Now I do a forEach Operation on Z and print the output to a file.

2. Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the
output to file

Are both functions being executed for each RDD in parallel? How does it work?


View raw message