spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdeali Kothari <abdealikoth...@gmail.com>
Subject Re: Repartition not working on a csv file
Date Sun, 01 Jul 2018 03:04:45 GMT
I've tried that too - it doesn't work. It does a repetition, but not right
after the broadcast join - it does a lot more processing and does the
repetition right before I do my next sortmerge join (stage 12 I described
above)
As the heavy processing is before the sort merge join, it still doesn't help

On Sun, Jul 1, 2018, 08:30 yujhe.li <liyujhe@gmail.com> wrote:

> Abdeali Kothari wrote
> > My entire CSV is less than 20KB.
> > By somewhere in between, I do a broadcast join with 3500 records in
> > another
> > file.
> > After the broadcast join I have a lot of processing to do. Overall, the
> > time to process a single record goes up-to 5mins on 1 executor
> >
> > I'm trying to increase the partitions that my data is in so that I have
> at
> > maximum 1 record per executor (currently it sets 2 tasks, and hence 2
> > executors... I want it to split it into at least 100 tasks at a time so I
> > get 5 records per task => ~20min per task)
>
> Maybe you can try repartition(100) after broadcast join, the task number
> should change to 100 for your later transformation.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message