spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <>
Subject Re: Understanding what happens when a job is submitted to a cluster
Date Thu, 13 May 2021 15:56:29 GMT
The specifics depend on what's going on underneath. At the 10,000 foot level, you probably
know that Spark creates a Logical execution plan when you call it. It converts it into a execution
plan when you call an action. The Execution plan has stages that are run sequentially. Stages
are broken up into tasks that are run in parallel. 

So, there 2 questions to be answered How does Spark determine stages? How does Spark break
up stages into tasks?

How does Spark determine stages? At a 5,000 foot view, Spark will break up a job into stages
at points where a shuffle is required. This usually happens if you are doing a join, aggregation
or partition. The basic idea is that Spark is trying to minimize data movement. It looks at
the logical plan, divides it into stages such that data movement is minimized. The specifics
depends on what the job is doing.

How does Spark break up a stage into tasks? At a 5,000 foot view, it is trying to break up
the stage into bite sized pieces that can be processed independently. The specifics depends
on what is being done in the stage. For example, let's say, a stage is reading a file with
5 Million rows, transforming it and writing out. Spark will check how much memory it needs
it for one row by looking at the dataframe schema. Let's say each row is 10K. Then it looks
at how much memory it has per executor. Let's say 100M. From this it determines that one executor
can process 10K rows. So, each task can have 10K rows. 5M/10K = 500. It will create 500 tasks.
Each task will read 10K rows, transform them and write it to output
Again, this is an example. How tasks get divided depend a lot on what's happening in the stage.
It also optimizes the code and tries to push down predicates, which complicates things for
us users.

´╗┐On 5/13/21, 10:21 AM, "" <> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or
open attachments unless you can confirm the sender and know the content is safe.


           What happens when a job is submitted to a cluster? I know the 10,000
    foot overview of the spark architecture. But I need the minute details as to
    how spark estimates the resources to ask yarn, what's the response of yarn
    etc... I need the *step by step* understanding of the complete process. I
    searched through the net but I couldn't find any good material on this. Can
    anyone help me here?


    Sent from:

    To unsubscribe e-mail:

View raw message