spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: repartition vs partitionby
Date Sat, 17 Oct 2015 20:25:33 GMT
If the dataset allows it you can try to write a custom partitioner to help spark distribute
the data more uniformly.

Sent from my iPhone

On 17 Oct 2015, at 16:14, shahid ashraf <shahid@trialx.com<mailto:shahid@trialx.com>>
wrote:

yes i know about that,its in case to reduce partitions. the point here is the data is skewed
to few partitions..


On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey <raghavendra.pandey@gmail.com<mailto:raghavendra.pandey@gmail.com>>
wrote:
You can use coalesce function, if you want to reduce the number of partitions. This one minimizes
the data shuffle.

-Raghav

On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri <shahidashraff@icloud.com<mailto:shahidashraff@icloud.com>>
wrote:
Hi folks

I need to reparation large set of data around(300G) as i see some portions have large data(data
skew)

i have pairRDDs [({},{}),({},{}),({},{})]

what is the best way to solve the the problem
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<mailto:user-help@spark.apache.org>





--
with Regards
Shahid Ashraf
Mime
View raw message