spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Gaido <marcogaid...@gmail.com>
Subject Re: eager execution and debuggability
Date Tue, 08 May 2018 21:35:32 GMT
I am not sure how this is useful. For students, it is important to
understand how Spark works. This can be critical in many decision they have
to take (whether and what to cache for instance) in order to have
performant Spark application. Creating a eager execution probably can help
them having something running more easily, but let them also using Spark
knowing less about how it works, thus they are likely to write worse
application and to have more problems in debugging any kind of problem
which may later (in production) occur (therefore affecting their experience
with the tool).

Moreover, as Ryan also mentioned, there are tools/ways to force the
execution, helping in the debugging phase. So they can achieve without a
big effort the same result, but with a big difference: they are aware of
what is really happening, which may help them later.

Thanks,
Marco

2018-05-08 21:37 GMT+02:00 Ryan Blue <rblue@netflix.com.invalid>:

> At Netflix, we use Jupyter notebooks and consoles for interactive
> sessions. For anyone interested, this mode of interaction is really easy to
> add in Jupyter and PySpark. You would just define a different *repr_html*
> or *repr* method for Dataset that runs a take(10) or take(100) and
> formats the result.
>
> That way, the output of a cell or console execution always causes the
> dataframe to run and get displayed for that immediate feedback. But, there
> is no change to Spark’s behavior because the action is run by the REPL, and
> only when a dataframe is a result of an execution in order to display it.
> Intermediate results wouldn’t be run, but that gives users a way to avoid
> too many executions and would still support method chaining in the
> dataframe API (which would be horrible with an aggressive execution model).
>
> There are ways to do this in JVM languages as well if you are using a
> Scala or Java interpreter (see jvm-repr
> <https://github.com/jupyter/jvm-repr>). This is actually what we do in
> our Spark-based SQL interpreter to display results.
>
> rb
> ​
>
> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <koert@tresata.com> wrote:
>
>> yeah we run into this all the time with new hires. they will send emails
>> explaining there is an error in the .write operation and they are debugging
>> the writing to disk, focusing on that piece of code :)
>>
>> unrelated, but another frequent cause for confusion is cascading errors.
>> like the FetchFailedException. they will be debugging the reducer task not
>> realizing the error happened before that, and the FetchFailedException is
>> not the root cause.
>>
>>
>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> Similar to the thread yesterday about improving ML/DL integration, I'm
>>> sending another email on what I've learned recently from Spark users. I
>>> recently talked to some educators that have been teaching Spark in their
>>> (top-tier) university classes. They are some of the most important users
>>> for adoption because of the multiplicative effect they have on the future
>>> generation.
>>>
>>> To my surprise the single biggest ask they want is to enable eager
>>> execution mode on all operations for teaching and debuggability:
>>>
>>> (1) Most of the students are relatively new to programming, and they
>>> need multiple iterations to even get the most basic operation right. In
>>> these cases, in order to trigger an error, they would need to explicitly
>>> add actions, which is non-intuitive.
>>>
>>> (2) If they don't add explicit actions to every operation and there is a
>>> mistake, the error pops up somewhere later where an action is triggered.
>>> This is in a different position from the code that causes the problem, and
>>> difficult for students to correlate the two.
>>>
>>> I suspect in the real world a lot of Spark users also struggle in
>>> similar ways as these students. While eager execution is really not
>>> practical in big data, in learning environments or in development against
>>> small, sampled datasets it can be pretty helpful.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Mime
View raw message