spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kunft, Andreas" <>
Subject AW: [Spark R]: dapply only works for very small datasets
Date Tue, 28 Nov 2017 11:11:16 GMT
Thanks for the fast reply.

I tried it locally, with 1 - 8 slots on a 8 core machine w/ 25GB memory as well as on 4 nodes
with the same specifications.

When I shrink the data to around 100MB,

it runs in about 1 hour for 1 core and about 6 min with 8 cores.

I'm aware that the serDe takes time, but it seems there must be something else off considering
these numbers.

Von: Felix Cheung <>
Gesendet: Montag, 27. November 2017 20:20
An: Kunft, Andreas;
Betreff: Re: [Spark R]: dapply only works for very small datasets

What's the number of executor and/or number of partitions you are working with?

I'm afraid most of the problem is with the serialization deserialization overhead between
JVM and R...

From: Kunft, Andreas <>
Sent: Monday, November 27, 2017 10:27:33 AM
Subject: [Spark R]: dapply only works for very small datasets


I tried to execute some user defined functions with R using the airline arrival performance

While the examples from the documentation for the `<-` apply operator work perfectly fine
on a size ~9GB,

the `dapply` operator fails to finish even after ~4 hours.

I'm using a function similar to the one from the documentation:

df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

I checked Stackoverflow and even asked the question there as well, but till now the only answer
I got was:
"Avoid using dapply, gapply"

So, do I miss some parameters or is there are general limitation?
I'm using Spark 2.2.0 and read the data from HDFS 2.7.1 and played with several DOPs.


View raw message