spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: eager execution and debuggability
Date Tue, 08 May 2018 23:08:32 GMT
I've opened SPARK-24215 to track this.

On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <rxin@databricks.com> wrote:

> Yup. Sounds great. This is something simple Spark can do and provide huge
> value to the end users.
>
>
> On Tue, May 8, 2018 at 3:53 PM Ryan Blue <rblue@netflix.com> wrote:
>
>> Would be great if it is something more turn-key.
>>
>> We can easily add the __repr__ and _repr_html_ methods and behavior to
>> PySpark classes. We could also add a configuration property to determine
>> whether the dataset evaluation is eager or not. That would make it turn-key
>> for anyone running PySpark in Jupyter.
>>
>> For JVM languages, we could also add a dependency on jvm-repr and do the
>> same thing.
>>
>> rb
>> ​
>>
>> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> s/underestimated/overestimated/
>>>
>>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <rxin@databricks.com> wrote:
>>>
>>>> Marco,
>>>>
>>>> There is understanding how Spark works, and there is finding bugs early
>>>> in their own program. One can perfectly understand how Spark works and
>>>> still find it valuable to get feedback asap, and that's why we built eager
>>>> analysis in the first place.
>>>>
>>>> Also I'm afraid you've significantly underestimated the level of
>>>> technical sophistication of users. In many cases they struggle to get
>>>> anything to work, and performance optimization of their programs is
>>>> secondary to getting things working. As John Ousterhout says, "the greatest
>>>> performance improvement of all is when a system goes from not-working to
>>>> working".
>>>>
>>>> I really like Ryan's approach. Would be great if it is something more
>>>> turn-key.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaido91@gmail.com>
>>>> wrote:
>>>>
>>>>> I am not sure how this is useful. For students, it is important to
>>>>> understand how Spark works. This can be critical in many decision they
have
>>>>> to take (whether and what to cache for instance) in order to have
>>>>> performant Spark application. Creating a eager execution probably can
help
>>>>> them having something running more easily, but let them also using Spark
>>>>> knowing less about how it works, thus they are likely to write worse
>>>>> application and to have more problems in debugging any kind of problem
>>>>> which may later (in production) occur (therefore affecting their experience
>>>>> with the tool).
>>>>>
>>>>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>>>>> execution, helping in the debugging phase. So they can achieve without
a
>>>>> big effort the same result, but with a big difference: they are aware
of
>>>>> what is really happening, which may help them later.
>>>>>
>>>>> Thanks,
>>>>> Marco
>>>>>
>>>>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rblue@netflix.com.invalid>:
>>>>>
>>>>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>>>>> sessions. For anyone interested, this mode of interaction is really
easy to
>>>>>> add in Jupyter and PySpark. You would just define a different
>>>>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>>>>> take(100) and formats the result.
>>>>>>
>>>>>> That way, the output of a cell or console execution always causes
the
>>>>>> dataframe to run and get displayed for that immediate feedback. But,
there
>>>>>> is no change to Spark’s behavior because the action is run by the
REPL, and
>>>>>> only when a dataframe is a result of an execution in order to display
it.
>>>>>> Intermediate results wouldn’t be run, but that gives users a way
to avoid
>>>>>> too many executions and would still support method chaining in the
>>>>>> dataframe API (which would be horrible with an aggressive execution
model).
>>>>>>
>>>>>> There are ways to do this in JVM languages as well if you are using
a
>>>>>> Scala or Java interpreter (see jvm-repr
>>>>>> <https://github.com/jupyter/jvm-repr>). This is actually what
we do
>>>>>> in our Spark-based SQL interpreter to display results.
>>>>>>
>>>>>> rb
>>>>>> ​
>>>>>>
>>>>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <koert@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> yeah we run into this all the time with new hires. they will
send
>>>>>>> emails explaining there is an error in the .write operation and
they are
>>>>>>> debugging the writing to disk, focusing on that piece of code
:)
>>>>>>>
>>>>>>> unrelated, but another frequent cause for confusion is cascading
>>>>>>> errors. like the FetchFailedException. they will be debugging
the reducer
>>>>>>> task not realizing the error happened before that, and the
>>>>>>> FetchFailedException is not the root cause.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <rxin@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Similar to the thread yesterday about improving ML/DL integration,
>>>>>>>> I'm sending another email on what I've learned recently from
Spark users. I
>>>>>>>> recently talked to some educators that have been teaching
Spark in their
>>>>>>>> (top-tier) university classes. They are some of the most
important users
>>>>>>>> for adoption because of the multiplicative effect they have
on the future
>>>>>>>> generation.
>>>>>>>>
>>>>>>>> To my surprise the single biggest ask they want is to enable
eager
>>>>>>>> execution mode on all operations for teaching and debuggability:
>>>>>>>>
>>>>>>>> (1) Most of the students are relatively new to programming,
and
>>>>>>>> they need multiple iterations to even get the most basic
operation right.
>>>>>>>> In these cases, in order to trigger an error, they would
need to explicitly
>>>>>>>> add actions, which is non-intuitive.
>>>>>>>>
>>>>>>>> (2) If they don't add explicit actions to every operation
and there
>>>>>>>> is a mistake, the error pops up somewhere later where an
action is
>>>>>>>> triggered. This is in a different position from the code
that causes the
>>>>>>>> problem, and difficult for students to correlate the two.
>>>>>>>>
>>>>>>>> I suspect in the real world a lot of Spark users also struggle
in
>>>>>>>> similar ways as these students. While eager execution is
really not
>>>>>>>> practical in big data, in learning environments or in development
against
>>>>>>>> small, sampled datasets it can be pretty helpful.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message