spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Difference between dataset and dataframe
Date Tue, 19 Feb 2019 04:47:25 GMT
in the api DataFrame is just Dataset[Row]. so this makes you think Dataset
is the generic api. interestingly enough under the hood everything is
really Dataset[Row], so DataFrame is really the "native" language for spark
sql, not Dataset.

i find DataFrame to be significantly more performant. in general if you use
Dataset you miss out on some optimizations. also Encoders are not very
pleasant to work with.

On Mon, Feb 18, 2019 at 9:09 PM Akhilanand <akhilanand.bv@gmail.com> wrote:

>
> Hello,
>
> I have been recently exploring about dataset and dataframes. I would
> really appreciate if someone could answer these questions:
>
> 1) Is there any difference in terms performance when we use datasets over
> dataframes? Is it significant to choose 1 over other. I do realise there
> would be some overhead due case classes but how significant is that? Are
> there any other implications.
>
> 2) Is the Tungsten code generation done only for datasets or is there any
> internal process to generate bytecode for dataframes as well? Since its
> related to jvm , I think its just for datasets but I couldn’t find anything
> that tells it specifically. If its just for datasets , does that mean we
> miss out on the project tungsten optimisation for dataframes?
>
>
>
> Regards,
> Akhilanand BV
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message