spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Czech <alexander.cz...@googlemail.com.INVALID>
Subject Re: Repartition not working on a csv file
Date Sun, 01 Jul 2018 10:20:03 GMT
You could try to force a repartion right at that point by producing a
cached version of the DF with .cache() if memory allows it.

On Sun, Jul 1, 2018 at 5:04 AM, Abdeali Kothari <abdealikothari@gmail.com>
wrote:

> I've tried that too - it doesn't work. It does a repetition, but not right
> after the broadcast join - it does a lot more processing and does the
> repetition right before I do my next sortmerge join (stage 12 I described
> above)
> As the heavy processing is before the sort merge join, it still doesn't
> help
>
> On Sun, Jul 1, 2018, 08:30 yujhe.li <liyujhe@gmail.com> wrote:
>
>> Abdeali Kothari wrote
>> > My entire CSV is less than 20KB.
>> > By somewhere in between, I do a broadcast join with 3500 records in
>> > another
>> > file.
>> > After the broadcast join I have a lot of processing to do. Overall, the
>> > time to process a single record goes up-to 5mins on 1 executor
>> >
>> > I'm trying to increase the partitions that my data is in so that I have
>> at
>> > maximum 1 record per executor (currently it sets 2 tasks, and hence 2
>> > executors... I want it to split it into at least 100 tasks at a time so
>> I
>> > get 5 records per task => ~20min per task)
>>
>> Maybe you can try repartition(100) after broadcast join, the task number
>> should change to 100 for your later transformation.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message