spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <Jayesh.Lalw...@capitalone.com>
Subject Re: eager execution and debuggability
Date Thu, 10 May 2018 13:32:20 GMT
If they are struggling to find bugs in their program because of lazy execution model of Spark,
they are going to struggle to debug issues when the program runs into problems in production.
Learning how to debug Spark is part of learning Spark. It’s better that they run into issues
in the classroom, and spend time-effort learning how to debug such issues rather than deploy
critical code to production and not know how to resolve the issues

I would say that if they are struggling how to read and analyze a stack trace, then they are
missing a prerequisite. They need to be taught how to look at a stack trace critically before
they start on Spark. Learning how to analyze stack traces is part of learning Scala/Java/Python.
They need to drop Spark, and go back to learning core Scala/Java/Python.



From: Reynold Xin <rxin@databricks.com>
Date: Tuesday, May 8, 2018 at 6:45 PM
To: Marco Gaido <marcogaido91@gmail.com>
Cc: Ryan Blue <rblue@netflix.com>, Koert Kuipers <koert@tresata.com>, dev <dev@spark.apache.org>
Subject: Re: eager execution and debuggability

Marco,

There is understanding how Spark works, and there is finding bugs early in their own program.
One can perfectly understand how Spark works and still find it valuable to get feedback asap,
and that's why we built eager analysis in the first place.

Also I'm afraid you've significantly underestimated the level of technical sophistication
of users. In many cases they struggle to get anything to work, and performance optimization
of their programs is secondary to getting things working. As John Ousterhout says, "the greatest
performance improvement of all is when a system goes from not-working to working".

I really like Ryan's approach. Would be great if it is something more turn-key.






On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaido91@gmail.com<mailto:marcogaido91@gmail.com>>
wrote:
I am not sure how this is useful. For students, it is important to understand how Spark works.
This can be critical in many decision they have to take (whether and what to cache for instance)
in order to have performant Spark application. Creating a eager execution probably can help
them having something running more easily, but let them also using Spark knowing less about
how it works, thus they are likely to write worse application and to have more problems in
debugging any kind of problem which may later (in production) occur (therefore affecting their
experience with the tool).

Moreover, as Ryan also mentioned, there are tools/ways to force the execution, helping in
the debugging phase. So they can achieve without a big effort the same result, but with a
big difference: they are aware of what is really happening, which may help them later.

Thanks,
Marco

2018-05-08 21:37 GMT+02:00 Ryan Blue <rblue@netflix.com.invalid<mailto:rblue@netflix.com.invalid>>:

At Netflix, we use Jupyter notebooks and consoles for interactive sessions. For anyone interested,
this mode of interaction is really easy to add in Jupyter and PySpark. You would just define
a different repr_html or repr method for Dataset that runs a take(10) or take(100) and formats
the result.

That way, the output of a cell or console execution always causes the dataframe to run and
get displayed for that immediate feedback. But, there is no change to Spark’s behavior because
the action is run by the REPL, and only when a dataframe is a result of an execution in order
to display it. Intermediate results wouldn’t be run, but that gives users a way to avoid
too many executions and would still support method chaining in the dataframe API (which would
be horrible with an aggressive execution model).

There are ways to do this in JVM languages as well if you are using a Scala or Java interpreter
(see jvm-repr<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_jupyter_jvm-2Drepr&d=DwMFaQ&c=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE&r=F2RNeGILvLdBxn7RJ4effes_QFIiEsoVM2rPi9qX1DKow5HQSjq0_WhIW109SXQ4&m=A9xeF6gJyhcGKkL4RvY3xe9xqVrpdeqGhn4KU-sILKM&s=5VCIFM1SFIgLfvmwAg5cih5M4JHngcCWqNO2lrr_JLU&e=>).
This is actually what we do in our Spark-based SQL interpreter to display results.

rb
​

On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <koert@tresata.com<mailto:koert@tresata.com>>
wrote:
yeah we run into this all the time with new hires. they will send emails explaining there
is an error in the .write operation and they are debugging the writing to disk, focusing on
that piece of code :)
unrelated, but another frequent cause for confusion is cascading errors. like the FetchFailedException.
they will be debugging the reducer task not realizing the error happened before that, and
the FetchFailedException is not the root cause.
[https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]


On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <rxin@databricks.com<mailto:rxin@databricks.com>>
wrote:
Similar to the thread yesterday about improving ML/DL integration, I'm sending another email
on what I've learned recently from Spark users. I recently talked to some educators that have
been teaching Spark in their (top-tier) university classes. They are some of the most important
users for adoption because of the multiplicative effect they have on the future generation.

To my surprise the single biggest ask they want is to enable eager execution mode on all operations
for teaching and debuggability:

(1) Most of the students are relatively new to programming, and they need multiple iterations
to even get the most basic operation right. In these cases, in order to trigger an error,
they would need to explicitly add actions, which is non-intuitive.

(2) If they don't add explicit actions to every operation and there is a mistake, the error
pops up somewhere later where an action is triggered. This is in a different position from
the code that causes the problem, and difficult for students to correlate the two.

I suspect in the real world a lot of Spark users also struggle in similar ways as these students.
While eager execution is really not practical in big data, in learning environments or in
development against small, sampled datasets it can be pretty helpful.













--
Ryan Blue
Software Engineer
Netflix

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
Mime
View raw message