spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Hunter <>
Subject Re: eager execution and debuggability
Date Wed, 09 May 2018 12:39:34 GMT
The repr() trick is neat when working on a notebook. When working in a
library, I used to use an evaluate(dataframe) -> DataFrame function that
simply forces the materialization of a dataframe. As Reynold mentions, this
is very convenient when working on a lot of chained UDFs, and it is a
standard trick in lazy environments and languages.


On Wed, May 9, 2018 at 3:26 AM, Reynold Xin <> wrote:

> Yes would be great if possible but it’s non trivial (might be impossible
> to do in general; we already have stacktraces that point to line numbers
> when an error occur in UDFs but clearly that’s not sufficient). Also in
> environments like REPL it’s still more useful to show error as soon as it
> occurs, rather than showing it potentially 30 lines later.
> On Tue, May 8, 2018 at 7:22 PM Nicholas Chammas <
>> wrote:
>> This may be technically impractical, but it would be fantastic if we
>> could make it easier to debug Spark programs without needing to rely on
>> eager execution. Sprinkling .count() and .checkpoint() at various points
>> in my code is still a debugging technique I use, but it always makes me
>> wish Spark could point more directly to the offending transformation when
>> something goes wrong.
>> Is it somehow possible to have each individual operator (is that the
>> correct term?) in a DAG include metadata pointing back to the line of code
>> that generated the operator? That way when an action triggers an error, the
>> failing operation can point to the relevant line of code — even if it’s a
>> transformation — and not just the action on the tail end that triggered the
>> error.
>> I don’t know how feasible this is, but addressing it would directly solve
>> the issue of linking failures to the responsible transformation, as opposed
>> to leaving the user to break up a chain of transformations with several
>> debug actions. And this would benefit new and experienced users alike.
>> Nick
>> 2018년 5월 8일 (화) 오후 7:09, Ryan Blue
>> <>님이 작성:
>> I've opened SPARK-24215 to track this.
>>> On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <> wrote:
>>>> Yup. Sounds great. This is something simple Spark can do and provide
>>>> huge value to the end users.
>>>> On Tue, May 8, 2018 at 3:53 PM Ryan Blue <> wrote:
>>>>> Would be great if it is something more turn-key.
>>>>> We can easily add the __repr__ and _repr_html_ methods and behavior
>>>>> to PySpark classes. We could also add a configuration property to determine
>>>>> whether the dataset evaluation is eager or not. That would make it turn-key
>>>>> for anyone running PySpark in Jupyter.
>>>>> For JVM languages, we could also add a dependency on jvm-repr and do
>>>>> the same thing.
>>>>> rb
>>>>> ​
>>>>> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <>
>>>>> wrote:
>>>>>> s/underestimated/overestimated/
>>>>>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <>
>>>>>> wrote:
>>>>>>> Marco,
>>>>>>> There is understanding how Spark works, and there is finding
>>>>>>> early in their own program. One can perfectly understand how
Spark works
>>>>>>> and still find it valuable to get feedback asap, and that's why
we built
>>>>>>> eager analysis in the first place.
>>>>>>> Also I'm afraid you've significantly underestimated the level
>>>>>>> technical sophistication of users. In many cases they struggle
to get
>>>>>>> anything to work, and performance optimization of their programs
>>>>>>> secondary to getting things working. As John Ousterhout says,
"the greatest
>>>>>>> performance improvement of all is when a system goes from not-working
>>>>>>> working".
>>>>>>> I really like Ryan's approach. Would be great if it is something
>>>>>>> more turn-key.
>>>>>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <>
>>>>>>> wrote:
>>>>>>>> I am not sure how this is useful. For students, it is important
>>>>>>>> understand how Spark works. This can be critical in many
decision they have
>>>>>>>> to take (whether and what to cache for instance) in order
to have
>>>>>>>> performant Spark application. Creating a eager execution
probably can help
>>>>>>>> them having something running more easily, but let them also
using Spark
>>>>>>>> knowing less about how it works, thus they are likely to
write worse
>>>>>>>> application and to have more problems in debugging any kind
of problem
>>>>>>>> which may later (in production) occur (therefore affecting
their experience
>>>>>>>> with the tool).
>>>>>>>> Moreover, as Ryan also mentioned, there are tools/ways to
force the
>>>>>>>> execution, helping in the debugging phase. So they can achieve
without a
>>>>>>>> big effort the same result, but with a big difference: they
are aware of
>>>>>>>> what is really happening, which may help them later.
>>>>>>>> Thanks,
>>>>>>>> Marco
>>>>>>>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <>:
>>>>>>>>> At Netflix, we use Jupyter notebooks and consoles for
>>>>>>>>> sessions. For anyone interested, this mode of interaction
is really easy to
>>>>>>>>> add in Jupyter and PySpark. You would just define a different
>>>>>>>>> *repr_html* or *repr* method for Dataset that runs a
take(10) or
>>>>>>>>> take(100) and formats the result.
>>>>>>>>> That way, the output of a cell or console execution always
>>>>>>>>> the dataframe to run and get displayed for that immediate
feedback. But,
>>>>>>>>> there is no change to Spark’s behavior because the
action is run by the
>>>>>>>>> REPL, and only when a dataframe is a result of an execution
in order to
>>>>>>>>> display it. Intermediate results wouldn’t be run, but
that gives users a
>>>>>>>>> way to avoid too many executions and would still support
method chaining in
>>>>>>>>> the dataframe API (which would be horrible with an aggressive
>>>>>>>>> model).
>>>>>>>>> There are ways to do this in JVM languages as well if
you are
>>>>>>>>> using a Scala or Java interpreter (see jvm-repr
>>>>>>>>> <>). This is
actually what we
>>>>>>>>> do in our Spark-based SQL interpreter to display results.
>>>>>>>>> rb
>>>>>>>>> ​
>>>>>>>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <>
>>>>>>>>> wrote:
>>>>>>>>>> yeah we run into this all the time with new hires.
they will send
>>>>>>>>>> emails explaining there is an error in the .write
operation and they are
>>>>>>>>>> debugging the writing to disk, focusing on that piece
of code :)
>>>>>>>>>> unrelated, but another frequent cause for confusion
is cascading
>>>>>>>>>> errors. like the FetchFailedException. they will
be debugging the reducer
>>>>>>>>>> task not realizing the error happened before that,
and the
>>>>>>>>>> FetchFailedException is not the root cause.
>>>>>>>>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <>
>>>>>>>>>> wrote:
>>>>>>>>>>> Similar to the thread yesterday about improving
>>>>>>>>>>> integration, I'm sending another email on what
I've learned recently from
>>>>>>>>>>> Spark users. I recently talked to some educators
that have been teaching
>>>>>>>>>>> Spark in their (top-tier) university classes.
They are some of the most
>>>>>>>>>>> important users for adoption because of the multiplicative
effect they have
>>>>>>>>>>> on the future generation.
>>>>>>>>>>> To my surprise the single biggest ask they want
is to enable
>>>>>>>>>>> eager execution mode on all operations for teaching
and debuggability:
>>>>>>>>>>> (1) Most of the students are relatively new to
programming, and
>>>>>>>>>>> they need multiple iterations to even get the
most basic operation right.
>>>>>>>>>>> In these cases, in order to trigger an error,
they would need to explicitly
>>>>>>>>>>> add actions, which is non-intuitive.
>>>>>>>>>>> (2) If they don't add explicit actions to every
operation and
>>>>>>>>>>> there is a mistake, the error pops up somewhere
later where an action is
>>>>>>>>>>> triggered. This is in a different position from
the code that causes the
>>>>>>>>>>> problem, and difficult for students to correlate
the two.
>>>>>>>>>>> I suspect in the real world a lot of Spark users
also struggle
>>>>>>>>>>> in similar ways as these students. While eager
execution is really not
>>>>>>>>>>> practical in big data, in learning environments
or in development against
>>>>>>>>>>> small, sampled datasets it can be pretty helpful.
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> ​

View raw message