spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject Re: Repartition not working on a csv file
Date Sun, 01 Jul 2018 03:00:35 GMT
Abdeali Kothari wrote
> My entire CSV is less than 20KB.
> By somewhere in between, I do a broadcast join with 3500 records in
> another
> file.
> After the broadcast join I have a lot of processing to do. Overall, the
> time to process a single record goes up-to 5mins on 1 executor
> I'm trying to increase the partitions that my data is in so that I have at
> maximum 1 record per executor (currently it sets 2 tasks, and hence 2
> executors... I want it to split it into at least 100 tasks at a time so I
> get 5 records per task => ~20min per task)

Maybe you can try repartition(100) after broadcast join, the task number
should change to 100 for your later transformation.

Sent from:

To unsubscribe e-mail:

View raw message