spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Pisupat <krishna.pisu...@gmail.com>
Subject Re: How to split one big RDD into several small ones
Date Wed, 11 Sep 2013 09:03:00 GMT
BTW, when I said about using one partition per filter, I meant to override
the default Partititioner with a custom one so that the elements are
partitioned that all elements that satisfy a filter are made part of the
same partition. The filter logic has to go into the custom partitioner.


On Wed, Sep 11, 2013 at 1:15 AM, Krishna Pisupat
<krishna.pisupat@gmail.com>wrote:

> I think there is no direct way. Did you look at using partitions to
> achieve it? All the elements that satisfies a filter would belong to a
> partition. Look at PartitionPruningRDD.
> http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/rdd/PartitionPruningRDD.html.
> May be it could help you achieve what you are trying to do.
>
>
> On Tue, Sep 10, 2013 at 6:24 PM, Xiang Huo <huoxiang5659@gmail.com> wrote:
>
>> Hi,
>>
>> I am try to get some sub dataset from one large dataset. I know one
>> method is that i can run val small = big.filter(...) and then save this RDD
>> as textFile for n times, where n is the number of sub dataset I want. But I
>> wonder this there any way that I can traverse one time for the large
>> dataset? Because in my case the large dataset is more than several TB and
>> each record in it can only be classified in one sub dataset.
>>
>> Any help is appreciated.
>>
>> Thanks
>>
>> Xiang
>> --
>> Xiang Huo
>> Department of Computer Science
>> University of Illinois at Chicago(UIC)
>> Chicago, Illinois
>> US
>> Email: huoxiang5659@gmail.com
>>            or xhuo4@uic.edu
>>
>
>

Mime
View raw message