spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Pisupat <>
Subject Re: How to split one big RDD into several small ones
Date Wed, 11 Sep 2013 20:35:08 GMT
You need to implement Partitioner class

And let the RDD know to use this custom partitioner by invoking
Rdd.partitionBy method. However, this is applicable to only key value type

, mapSideCombine: Boolean = false):

Return a copy of the RDD partitioned using the specified partitioner. If
mapSideCombine is true, Spark will group values of the same key together on
the map side before the repartitioning, to only send each key over the
network once. If a large number of duplicated keys are expected, and the
size of the keys are large, mapSideCombine should be set to true.

On Wed, Sep 11, 2013 at 9:03 AM, Xiang Huo <> wrote:

> Hi Krishna,
> Thank for your reply, I think It is helpful. I am reading the source code
> of Partitioner.scala. But I am still not clear about how to use partition
> and partitioner on a RDD. Is it  possible that I can find some simple
> example about how to call them?
> Thanks.
> Xiang
> 2013/9/11 Krishna Pisupat <>
>> BTW, when I said about using one partition per filter, I meant to
>> override the default Partititioner with a custom one so that the elements
>> are partitioned that all elements that satisfy a filter are made part of
>> the same partition. The filter logic has to go into the custom partitioner.
>> On Wed, Sep 11, 2013 at 1:15 AM, Krishna Pisupat <
>>> wrote:
>>> I think there is no direct way. Did you look at using partitions to
>>> achieve it? All the elements that satisfies a filter would belong to a
>>> partition. Look at PartitionPruningRDD.
>>> May be it could help you achieve what you are trying to do.
>>> On Tue, Sep 10, 2013 at 6:24 PM, Xiang Huo <>wrote:
>>>> Hi,
>>>> I am try to get some sub dataset from one large dataset. I know one
>>>> method is that i can run val small = big.filter(...) and then save this RDD
>>>> as textFile for n times, where n is the number of sub dataset I want. But
>>>> wonder this there any way that I can traverse one time for the large
>>>> dataset? Because in my case the large dataset is more than several TB and
>>>> each record in it can only be classified in one sub dataset.
>>>> Any help is appreciated.
>>>> Thanks
>>>> Xiang
>>>> --
>>>> Xiang Huo
>>>> Department of Computer Science
>>>> University of Illinois at Chicago(UIC)
>>>> Chicago, Illinois
>>>> US
>>>> Email:
>>>>            or
> --
> Xiang Huo
> Department of Computer Science
> University of Illinois at Chicago(UIC)
> Chicago, Illinois
> US
> Email:
>            or

View raw message