spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <>
Subject Re: large scheduler delay in pyspark
Date Tue, 04 Aug 2015 07:13:59 GMT
On Mon, Aug 3, 2015 at 9:00 AM, gen tang <> wrote:
> Hi,
> Recently, I met some problems about scheduler delay in pyspark. I worked
> several days on this problem, but not success. Therefore, I come to here to
> ask for help.
> I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge
> value by "adding" two list
> if I do reduceByKey as follows:
>    rdd.reduceByKey(lambda a, b: a+b)
> It works fine, scheduler delay is less than 10s. However if I do
> reduceByKey:
>    def f(a, b):
>        for i in b:
>             if i not in a:
>                a.append(i)
>        return a
>   rdd.reduceByKey(f)

Is it possible that you have large object that is also named `i` or `a` or `b`?

Btw, the second one could be slow than first one, because you try to lookup
a object in a list, that is O(N), especially when the object is large (dict).

> It will cause very large scheduler delay, about 15-20 mins.(The data I deal
> with is about 300 mb, and I use 5 machine with 32GB memory)

If you see scheduler delay, it means there may be a large broadcast involved.

> I know the second code is not the same as the first. In fact, my purpose is
> to implement the second, but not work. So I try the first one.
> I don't know whether this is related to the data(with long string) or Spark
> on Yarn. But the first code works fine on the same data.
> Is there any way to find out the log when spark stall in scheduler delay,
> please? Or any ideas about this problem?
> Thanks a lot in advance for your help.
> Cheers
> Gen

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message