spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Will .count() always trigger an evaluation of each row?
Date Sat, 18 Feb 2017 10:16:23 GMT
I think the right answer is "don't do that" but if you really had to you
could trigger a Dataset operation that does nothing per partition. I
presume that would be more reliable because the whole partition has to be
computed to make it available in practice. Or, go so far as to loop over
every element.

On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <nicholas.chammas@gmail.com>
wrote:

> Especially during development, people often use .count() or
> .persist().count() to force evaluation of all rows — exposing any
> problems, e.g. due to bad data — and to load data into cache to speed up
> subsequent operations.
>
> But as the optimizer gets smarter, I’m guessing it will eventually learn
> that it doesn’t have to do all that work to give the correct count. (This
> blog post
> <https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html>
> suggests that something like this is already happening.) This will change
> Spark’s practical behavior while technically preserving semantics.
>
> What will people need to do then to force evaluation or caching?
>
> Nick
> ​
>

Mime
View raw message