spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Pisupat <>
Subject Re: How to split one big RDD into several small ones
Date Wed, 11 Sep 2013 09:03:00 GMT
BTW, when I said about using one partition per filter, I meant to override
the default Partititioner with a custom one so that the elements are
partitioned that all elements that satisfy a filter are made part of the
same partition. The filter logic has to go into the custom partitioner.

On Wed, Sep 11, 2013 at 1:15 AM, Krishna Pisupat

> I think there is no direct way. Did you look at using partitions to
> achieve it? All the elements that satisfies a filter would belong to a
> partition. Look at PartitionPruningRDD.
> May be it could help you achieve what you are trying to do.
> On Tue, Sep 10, 2013 at 6:24 PM, Xiang Huo <> wrote:
>> Hi,
>> I am try to get some sub dataset from one large dataset. I know one
>> method is that i can run val small = big.filter(...) and then save this RDD
>> as textFile for n times, where n is the number of sub dataset I want. But I
>> wonder this there any way that I can traverse one time for the large
>> dataset? Because in my case the large dataset is more than several TB and
>> each record in it can only be classified in one sub dataset.
>> Any help is appreciated.
>> Thanks
>> Xiang
>> --
>> Xiang Huo
>> Department of Computer Science
>> University of Illinois at Chicago(UIC)
>> Chicago, Illinois
>> US
>> Email:
>>            or

View raw message