spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: [SPARK STREAMING] Concurrent operations in spark streaming
Date Mon, 26 Oct 2015 09:57:49 GMT
If I understand the order correctly, not really. First of all, the easiest way to make sure
it works as expected is to check out the visual DAG in the spark UI.

It should map 1:1 to your code, and since I don’t see any shuffles in the operations below
it should execute all in one stage, forking after X.
That means that all the executor cores will each process a partition completely in isolation,
most likely 3 tasks (A, B, X2). Most likely in the order you define in code although depending
on the data some tasks may get skipped or moved around.

I’m curious – why do you ask? Do you have a particular concern or use case that relies
on ordering between A, B and X2?

-adrian

From: Nipun Arora
Date: Sunday, October 25, 2015 at 4:09 PM
To: Andy Dang
Cc: user
Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming

So essentially the driver/client program needs to explicitly have two threads to ensure concurrency?

What happens when the program is sequential... I.e. I execute function A and then function
B. Does this mean that each RDD first goes through function A, and them stream X is persisted,
but processed in function B only after the RDD has been processed by A?

Thanks
Nipun
On Sat, Oct 24, 2015 at 5:32 PM Andy Dang <namd88@gmail.com<mailto:namd88@gmail.com>>
wrote:
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the
driver then both of them will be executed in parallel. Whichever gets submitted to Spark first
gets executed first - you can use a semaphore if you need to ensure the ordering of execution,
though I would assume that the ordering wouldn't matter.

-------
Regards,
Andy

On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora <nipunarora2512@gmail.com<mailto:nipunarora2512@gmail.com>>
wrote:
I wanted to understand something about the internals of spark streaming executions.

If I have a stream X, and in my program I send stream X to function A and function B:

1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream
Z. Now I do a forEach Operation on Z and print the output to a file.

2. Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the
output to file

Are both functions being executed for each RDD in parallel? How does it work?

Thanks
Nipun


Mime
View raw message