spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yujhe.li" <liyu...@gmail.com>
Subject Re: Repartition not working on a csv file
Date Sun, 01 Jul 2018 03:00:35 GMT
Abdeali Kothari wrote
> My entire CSV is less than 20KB.
> By somewhere in between, I do a broadcast join with 3500 records in
> another
> file.
> After the broadcast join I have a lot of processing to do. Overall, the
> time to process a single record goes up-to 5mins on 1 executor
> 
> I'm trying to increase the partitions that my data is in so that I have at
> maximum 1 record per executor (currently it sets 2 tasks, and hence 2
> executors... I want it to split it into at least 100 tasks at a time so I
> get 5 records per task => ~20min per task)

Maybe you can try repartition(100) after broadcast join, the task number
should change to 100 for your later transformation.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message