spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Bryński <mac...@brynski.pl>
Subject Re: Difference between Data set and Data Frame in Spark 2
Date Thu, 01 Sep 2016 18:11:01 GMT
I think there could be performance reason.
RDD can be faster than Datasets.

For example check query plan for this code:
spark.range(100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()

There are two serialize / deserialize pairs.

And then compare with RDD equivalent.
sc.parallelize(1 to 100).map(_ * 2).filter(_ < 100).map(_ * 2).collect()

Regards,
M


2016-09-01 18:15 GMT+02:00 Sean Owen <sowen@cloudera.com>:

> On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh
> <mich.talebzadeh@gmail.com> wrote:
> > Data Frame built on top of RDD to create as tabular format that we all
> love
> > to make the original build easily usable (say SQL like queries, column
> > headings etc). The drawback is it restricts you with what you can do with
> > Data Frame (now that you have dome RDD.toDF)
>
> DataFrame is a Dataset[Row], literally, rather than based on an RDD.
>
> > DataSet  is the new RDD with improvements on RDD. As I understand from
> > Sean's explanation they add some optimisation on top the common RDD.
>
> At the moment I don't think there's any particular reason to use RDDs
> except to interoperate with code that uses RDDs -- which is entirely
> valid. I believe new code would generally touch only Dataset and
> DataFrame otherwise. So I don't think there are really 3 elemental
> concepts in play as of Spark 2.x.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Maciek Bryński

Mime
View raw message