spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdeali Kothari <abdealikoth...@gmail.com>
Subject Re: Repartition not working on a csv file
Date Sun, 01 Jul 2018 11:09:12 GMT
I prefer not to do a .cache() due to memory limits. But I did try a
persist() with DISK_ONLY

I did the repartition(), followed by a .count() followed by a persist() of
DISK_ONLY
That didn't change the number of tasks either



On Sun, Jul 1, 2018, 15:50 Alexander Czech <alexander.czech@googlemail.com>
wrote:

> You could try to force a repartion right at that point by producing a
> cached version of the DF with .cache() if memory allows it.
>
> On Sun, Jul 1, 2018 at 5:04 AM, Abdeali Kothari <abdealikothari@gmail.com>
> wrote:
>
>> I've tried that too - it doesn't work. It does a repetition, but not
>> right after the broadcast join - it does a lot more processing and does the
>> repetition right before I do my next sortmerge join (stage 12 I described
>> above)
>> As the heavy processing is before the sort merge join, it still doesn't
>> help
>>
>> On Sun, Jul 1, 2018, 08:30 yujhe.li <liyujhe@gmail.com> wrote:
>>
>>> Abdeali Kothari wrote
>>> > My entire CSV is less than 20KB.
>>> > By somewhere in between, I do a broadcast join with 3500 records in
>>> > another
>>> > file.
>>> > After the broadcast join I have a lot of processing to do. Overall, the
>>> > time to process a single record goes up-to 5mins on 1 executor
>>> >
>>> > I'm trying to increase the partitions that my data is in so that I
>>> have at
>>> > maximum 1 record per executor (currently it sets 2 tasks, and hence 2
>>> > executors... I want it to split it into at least 100 tasks at a time
>>> so I
>>> > get 5 records per task => ~20min per task)
>>>
>>> Maybe you can try repartition(100) after broadcast join, the task number
>>> should change to 100 for your later transformation.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>

Mime
View raw message