spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "assaf.mendelson" <>
Subject RE: Will .count() always trigger an evaluation of each row?
Date Sun, 19 Feb 2017 09:13:05 GMT
Actually, when I did a simple test on parquet (“somefile”).cache().count())
the UI showed me that the entire file is cached. Is this just a fluke?

In any case I believe the question is still valid, how to make sure a dataframe is cached.
Consider for example a case where we read from a remote host (which is costly) and we want
to make sure the original read is done at a specific time (when the network is less crowded).
I for one used .count() till now but if this is not guaranteed to cache, then how would I
do that? Of course I could always save the dataframe to disk but that would cost a lot more
in performance than I would like…

As for doing a map partitions for the dataset, wouldn’t that cause the row to be converted
to the case class for each row? That could also be heavy.
Maybe cache should have a lazy parameter which would be false by default but we could call
.cache(true) to make it materialize (similar to what we have with checkpoint).

From: Matei Zaharia [via Apache Spark Developers List] []
Sent: Sunday, February 19, 2017 9:30 AM
To: Mendelson, Assaf
Subject: Re: Will .count() always trigger an evaluation of each row?

Count is different on DataFrames and Datasets from RDDs. On RDDs, it always evaluates everything,
but on DataFrame/Dataset, it turns into the equivalent of "select count(*) from ..." in SQL,
which can be done without scanning the data for some data formats (e.g. Parquet). On the other
hand though, caching a DataFrame / Dataset does require everything to be cached.


On Feb 18, 2017, at 2:16 AM, Sean Owen <[hidden email]</user/SendEmail.jtp?type=node&node=21024&i=0>>

I think the right answer is "don't do that" but if you really had to you could trigger a Dataset
operation that does nothing per partition. I presume that would be more reliable because the
whole partition has to be computed to make it available in practice. Or, go so far as to loop
over every element.

On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <[hidden email]</user/SendEmail.jtp?type=node&node=21024&i=1>>

Especially during development, people often use .count() or .persist().count() to force evaluation
of all rows — exposing any problems, e.g. due to bad data — and to load data into cache
to speed up subsequent operations.

But as the optimizer gets smarter, I’m guessing it will eventually learn that it doesn’t
have to do all that work to give the correct count. (This blog post<>
suggests that something like this is already happening.) This will change Spark’s practical
behavior while technically preserving semantics.

What will people need to do then to force evaluation or caching?


If you reply to this email, your message will be added to the discussion below:
To start a new topic under Apache Spark Developers List, email<>
To unsubscribe from Apache Spark Developers List, click here<>.

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at
View raw message