spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: eager execution and debuggability
Date Tue, 08 May 2018 22:58:32 GMT
Yup. Sounds great. This is something simple Spark can do and provide huge
value to the end users.


On Tue, May 8, 2018 at 3:53 PM Ryan Blue <rblue@netflix.com> wrote:

> Would be great if it is something more turn-key.
>
> We can easily add the __repr__ and _repr_html_ methods and behavior to
> PySpark classes. We could also add a configuration property to determine
> whether the dataset evaluation is eager or not. That would make it turn-key
> for anyone running PySpark in Jupyter.
>
> For JVM languages, we could also add a dependency on jvm-repr and do the
> same thing.
>
> rb
> ​
>
> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <rxin@databricks.com> wrote:
>
>> s/underestimated/overestimated/
>>
>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <rxin@databricks.com> wrote:
>>
>>> Marco,
>>>
>>> There is understanding how Spark works, and there is finding bugs early
>>> in their own program. One can perfectly understand how Spark works and
>>> still find it valuable to get feedback asap, and that's why we built eager
>>> analysis in the first place.
>>>
>>> Also I'm afraid you've significantly underestimated the level of
>>> technical sophistication of users. In many cases they struggle to get
>>> anything to work, and performance optimization of their programs is
>>> secondary to getting things working. As John Ousterhout says, "the greatest
>>> performance improvement of all is when a system goes from not-working to
>>> working".
>>>
>>> I really like Ryan's approach. Would be great if it is something more
>>> turn-key.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaido91@gmail.com>
>>> wrote:
>>>
>>>> I am not sure how this is useful. For students, it is important to
>>>> understand how Spark works. This can be critical in many decision they have
>>>> to take (whether and what to cache for instance) in order to have
>>>> performant Spark application. Creating a eager execution probably can help
>>>> them having something running more easily, but let them also using Spark
>>>> knowing less about how it works, thus they are likely to write worse
>>>> application and to have more problems in debugging any kind of problem
>>>> which may later (in production) occur (therefore affecting their experience
>>>> with the tool).
>>>>
>>>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>>>> execution, helping in the debugging phase. So they can achieve without a
>>>> big effort the same result, but with a big difference: they are aware of
>>>> what is really happening, which may help them later.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rblue@netflix.com.invalid>:
>>>>
>>>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>>>> sessions. For anyone interested, this mode of interaction is really easy
to
>>>>> add in Jupyter and PySpark. You would just define a different
>>>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>>>> take(100) and formats the result.
>>>>>
>>>>> That way, the output of a cell or console execution always causes the
>>>>> dataframe to run and get displayed for that immediate feedback. But,
there
>>>>> is no change to Spark’s behavior because the action is run by the REPL,
and
>>>>> only when a dataframe is a result of an execution in order to display
it.
>>>>> Intermediate results wouldn’t be run, but that gives users a way to
avoid
>>>>> too many executions and would still support method chaining in the
>>>>> dataframe API (which would be horrible with an aggressive execution model).
>>>>>
>>>>> There are ways to do this in JVM languages as well if you are using a
>>>>> Scala or Java interpreter (see jvm-repr
>>>>> <https://github.com/jupyter/jvm-repr>). This is actually what we
do
>>>>> in our Spark-based SQL interpreter to display results.
>>>>>
>>>>> rb
>>>>> ​
>>>>>
>>>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <koert@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> yeah we run into this all the time with new hires. they will send
>>>>>> emails explaining there is an error in the .write operation and they
are
>>>>>> debugging the writing to disk, focusing on that piece of code :)
>>>>>>
>>>>>> unrelated, but another frequent cause for confusion is cascading
>>>>>> errors. like the FetchFailedException. they will be debugging the
reducer
>>>>>> task not realizing the error happened before that, and the
>>>>>> FetchFailedException is not the root cause.
>>>>>>
>>>>>>
>>>>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <rxin@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Similar to the thread yesterday about improving ML/DL integration,
>>>>>>> I'm sending another email on what I've learned recently from
Spark users. I
>>>>>>> recently talked to some educators that have been teaching Spark
in their
>>>>>>> (top-tier) university classes. They are some of the most important
users
>>>>>>> for adoption because of the multiplicative effect they have on
the future
>>>>>>> generation.
>>>>>>>
>>>>>>> To my surprise the single biggest ask they want is to enable
eager
>>>>>>> execution mode on all operations for teaching and debuggability:
>>>>>>>
>>>>>>> (1) Most of the students are relatively new to programming, and
they
>>>>>>> need multiple iterations to even get the most basic operation
right. In
>>>>>>> these cases, in order to trigger an error, they would need to
explicitly
>>>>>>> add actions, which is non-intuitive.
>>>>>>>
>>>>>>> (2) If they don't add explicit actions to every operation and
there
>>>>>>> is a mistake, the error pops up somewhere later where an action
is
>>>>>>> triggered. This is in a different position from the code that
causes the
>>>>>>> problem, and difficult for students to correlate the two.
>>>>>>>
>>>>>>> I suspect in the real world a lot of Spark users also struggle
in
>>>>>>> similar ways as these students. While eager execution is really
not
>>>>>>> practical in big data, in learning environments or in development
against
>>>>>>> small, sampled datasets it can be pretty helpful.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Mime
View raw message