spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Pisupat <>
Subject Re: How to split one big RDD into several small ones
Date Wed, 11 Sep 2013 08:15:19 GMT
I think there is no direct way. Did you look at using partitions to achieve
it? All the elements that satisfies a filter would belong to a partition.
Look at PartitionPruningRDD.
May be it could help you achieve what you are trying to do.

On Tue, Sep 10, 2013 at 6:24 PM, Xiang Huo <> wrote:

> Hi,
> I am try to get some sub dataset from one large dataset. I know one method
> is that i can run val small = big.filter(...) and then save this RDD as
> textFile for n times, where n is the number of sub dataset I want. But I
> wonder this there any way that I can traverse one time for the large
> dataset? Because in my case the large dataset is more than several TB and
> each record in it can only be classified in one sub dataset.
> Any help is appreciated.
> Thanks
> Xiang
> --
> Xiang Huo
> Department of Computer Science
> University of Illinois at Chicago(UIC)
> Chicago, Illinois
> US
> Email:
>            or

View raw message