spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dueckm <>
Subject Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?
Date Wed, 03 Aug 2016 06:06:18 GMT

first of all - excuse me for sending this  post more than once, but I am new
to this mailing list and did not subscribe completely, so I suspect my
previous postings will not be accepted ...

I built a prototype that uses join and groupBy operations via Spark RDD
API. Recently I migrated it to the Dataset API. Now it runs much slower
than with the original RDD implementation. Did I do something wrong here?
Or is this the price I have to pay for the more convienient API?
Is there a known solution to deal with this effect (eg configuration via
"spark.sql.shuffle.partitions" - but how could I determine the correct
In my prototype I use Java Beans with a lot of attributes. Does this slow
down Spark-operations with Datasets?

Here I have an simple example, that shows the difference:
- I build 2 RDDs and join and group them. Afterwards I count and display
the joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD
() )
- When I do the same actions with Datasets it takes approximately 40 times
as long (Method de.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).

Thank you very much for your help.

PS: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
RDD implementation, jobs 2/3 to Dataset):


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:

View raw message