spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kiran lonikar <loni...@gmail.com>
Subject Re: columnar structure of RDDs from Parquet or ORC files
Date Mon, 08 Jun 2015 16:09:04 GMT
Hi Cheng, Ayan,

Thanks for the answers. I like the rule of thumb. I cursorily went through
the DataFrame, SQLContext and sql.execution.basicOperators.scala code. It
is apparent that these functions are lazily evaluated. The SQLContext.load
functions are similar to SparkContext.textFile kind of functions which
simply create an RDD but the data load actually happens when an action is
invoked.

I would like to modify the "formula" for DF. I think it should be *DF = RDD
+ Schema + additional methods (like loading/saving from/to JDBC) + columnar
layout. *DF is not a simple wrapper around RDD.

As another interest, I wanted check if some of the DF execution functions
can be executed on GPUs. For that to happen, the columnar layout is
important. Here is where DF scores over ordinary RDDs.

Seems like the batch size defined by
spark.sql.inMemoryColumnarStorage.batchSize is set to a default size of
10000. I wonder if it can be set to 100K or 1 million so that computations
involving primitive data types can be efficiently carried out on GPU (if
present). Higher batch sizes mean data can be efficiently transferred
to/from GPU RAM involving fewer transfers. Also, at least initially the
other parameter spark.sql.inMemoryColumnarStorage.compressed will have to
be set to false since uncompressing on GPU is not so straightforward
(issues of how much data each GPU thread should handle and uncoalesced
memory access).

-Kiran


On Mon, Jun 8, 2015 at 8:25 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

>  You may refer to DataFrame Scaladoc
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
>
> Methods listed in "Language Integrated Queries" and "RDD Options" can be
> viewed as "transformations", and those listed in "Actions" are, of course,
> actions.  As for SQLContext.load, it's listed in the "Generic Data Sources"
> section
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
>
> I think a simple rule can be: if a DataFrame method or a SQLContext method
> returns a DataFrame or an RDD, then it is lazily evaluated, since DataFrame
> and RDD are both lazily evaluated.
>
> Cheng
>
>
> On 6/8/15 8:11 PM, kiran lonikar wrote:
>
> Thanks. Can you point me to a place in the documentation of SQL
> programming guide or DataFrame scaladoc where this transformation and
> actions are grouped like in the case of RDD?
>
> Also if you can tell me if sqlContext.load and unionAll are
> transformations or actions...
>
> I answered a question on the forum assuming unionAll is a blocking call
> and said execution of multiple load and df.unionAll in different threads
> would benefit performance :)
>
> Kiran
> On 08-Jun-2015 4:37 pm, "Cheng Lian" <lian.cs.zju@gmail.com> wrote:
>
>>  For DataFrame, there are also transformations and actions. And
>> transformations are also lazily evaluated. However, DataFrame
>> transformations like filter(), select(), agg() return a DataFrame rather
>> than an RDD. Other methods like show() and collect() are actions.
>>
>> Cheng
>>
>> On 6/8/15 1:33 PM, kiran lonikar wrote:
>>
>> Thanks for replying twice :) I think I sent this question by email and
>> somehow thought I did not sent it, hence created the other one on the web
>> interface. Lets retain this thread since you have provided more details
>> here.
>>
>>  Great, it confirms my intuition about DataFrame. It's similar to Shark
>> columnar layout, with the addition of compression. There it used java nio's
>> ByteBuffer to hold actual data. I will go through the code you pointed.
>>
>>  I have another question about DataFrame: The RDD operations are divided
>> in two groups: *transformations *which are lazily evaluated and return a
>> new RDD and *actions *which evaluate lineage defined by transformations,
>> invoke actions and return results. What about DataFrame operations like
>> join, groupBy, agg, unionAll etc which are all transformations in RDD? Are
>> they lazily evaluated or immediately executed?
>>
>>
>>
>>
>

Mime
View raw message