spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: SchemaRDD compute function
Date Wed, 26 Nov 2014 17:37:28 GMT
Exactly how the query is executed actually depends on a couple of factors
as we do a bunch of optimizations based on the top physical operator and
the final RDD operation that is performed.  In general the compute function
is only used when you are doing SQL followed by other RDD operations (map,
flatMap, etc).  When you call collect we usually call collect directly on
the underlying physical RDD (which is not exposed to users since it plays
tricks like object reuse under the covers).  However, if your query has a
LIMIT then we perform a take, and if you have an ORDER BY and a LIMIT then
we takeOrdered, etc.

On Wed, Nov 26, 2014 at 5:05 AM, Jörg Schad <> wrote:

> Hi,
> I have a short question regarding the compute() of an SchemaRDD.
> For SchemaRDD the actual queryExecution seems to be triggered via
> collect(), while the compute  triggers only the compute() of the parent and
> copies the data (Please correct me if I am wrong!).
> Is this compute() triggered at all when I do something like:
> *val schemaRDD2 = schemaRDD.where(...)*
> *schemaRDD2.collect() *
> And if not when is the compute function triggered/ what is the intend
> behind it?
> Sorry if this is a trivial question, just getting started with spark
> (SQL)....
> Thanks,
> Joerg

View raw message