spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gen tang <gen.tan...@gmail.com>
Subject Re: large scheduler delay in pyspark
Date Wed, 05 Aug 2015 13:54:34 GMT
Hi,
Thanks a lot for your reply.


It seems that it is because of the slowness of the second code.
I rewrite code as list(set([i.items for i in a] + [i.items for i in b])).
The program returns normal.

By the way, I find that when the computation is running, UI will show
scheduler delay. However, it is not scheduler delay. When computation
finishes, UI will show correct scheduler delay time.

Cheers
Gen


On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu <davies@databricks.com> wrote:

> On Mon, Aug 3, 2015 at 9:00 AM, gen tang <gen.tang86@gmail.com> wrote:
> > Hi,
> >
> > Recently, I met some problems about scheduler delay in pyspark. I worked
> > several days on this problem, but not success. Therefore, I come to here
> to
> > ask for help.
> >
> > I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
> merge
> > value by "adding" two list
> >
> > if I do reduceByKey as follows:
> >    rdd.reduceByKey(lambda a, b: a+b)
> > It works fine, scheduler delay is less than 10s. However if I do
> > reduceByKey:
> >    def f(a, b):
> >        for i in b:
> >             if i not in a:
> >                a.append(i)
> >        return a
> >   rdd.reduceByKey(f)
>
> Is it possible that you have large object that is also named `i` or `a` or
> `b`?
>
> Btw, the second one could be slow than first one, because you try to lookup
> a object in a list, that is O(N), especially when the object is large
> (dict).
>
> > It will cause very large scheduler delay, about 15-20 mins.(The data I
> deal
> > with is about 300 mb, and I use 5 machine with 32GB memory)
>
> If you see scheduler delay, it means there may be a large broadcast
> involved.
>
> > I know the second code is not the same as the first. In fact, my purpose
> is
> > to implement the second, but not work. So I try the first one.
> > I don't know whether this is related to the data(with long string) or
> Spark
> > on Yarn. But the first code works fine on the same data.
> >
> > Is there any way to find out the log when spark stall in scheduler delay,
> > please? Or any ideas about this problem?
> >
> > Thanks a lot in advance for your help.
> >
> > Cheers
> > Gen
> >
> >
>

Mime
View raw message