spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <jlalw...@amazon.com.INVALID>
Subject Re: Understanding what happens when a job is submitted to a cluster
Date Thu, 13 May 2021 18:07:09 GMT
    1. How does spark know the data size is 5 million?
Depends on the source. Some sources (database/parquet) tell you. Some sources(CSV, JSON) need
to be guesstimated
    2. Are there any books or documentation that takes one simple job and goes
    deeper in terms of understanding what happens under the hood?
Jacek Laskowski has a good web book
(https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/) . Most people who understand
what's going under the hood have dove into the code. 


´╗┐On 5/13/21, 1:00 PM, "abhilash.kr" <abhilash.khokle@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or
open attachments unless you can confirm the sender and know the content is safe.



    Thank you. This was helpful. I have follow up questions.

    1. How does spark know the data size is 5 million?
    2. Are there any books or documentation that takes one simple job and goes
    deeper in terms of understanding what happens under the hood?




    --
    Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message