spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: [Spark R]: dapply only works for very small datasets
Date Mon, 27 Nov 2017 19:20:26 GMT
What's the number of executor and/or number of partitions you are working with?

I'm afraid most of the problem is with the serialization deserialization overhead between
JVM and R...

________________________________
From: Kunft, Andreas <andreas.kunft@tu-berlin.de>
Sent: Monday, November 27, 2017 10:27:33 AM
To: user@spark.apache.org
Subject: [Spark R]: dapply only works for very small datasets


Hello,


I tried to execute some user defined functions with R using the airline arrival performance
dataset.

While the examples from the documentation for the `<-` apply operator work perfectly fine
on a size ~9GB,

the `dapply` operator fails to finish even after ~4 hours.


I'm using a function similar to the one from the documentation:


df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

I checked Stackoverflow and even asked the question there as well, but till now the only answer
I got was:
"Avoid using dapply, gapply"

So, do I miss some parameters or is there are general limitation?
I'm using Spark 2.2.0 and read the data from HDFS 2.7.1 and played with several DOPs.

Best
Andreas


Mime
View raw message