spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiang Huo <huoxiang5...@gmail.com>
Subject Re: How to split one big RDD into several small ones
Date Wed, 11 Sep 2013 16:03:49 GMT
Hi Krishna,

Thank for your reply, I think It is helpful. I am reading the source code
of Partitioner.scala. But I am still not clear about how to use partition
and partitioner on a RDD. Is it  possible that I can find some simple
example about how to call them?

Thanks.

Xiang


2013/9/11 Krishna Pisupat <krishna.pisupat@gmail.com>

> BTW, when I said about using one partition per filter, I meant to override
> the default Partititioner with a custom one so that the elements are
> partitioned that all elements that satisfy a filter are made part of the
> same partition. The filter logic has to go into the custom partitioner.
>
>
> On Wed, Sep 11, 2013 at 1:15 AM, Krishna Pisupat <
> krishna.pisupat@gmail.com> wrote:
>
>> I think there is no direct way. Did you look at using partitions to
>> achieve it? All the elements that satisfies a filter would belong to a
>> partition. Look at PartitionPruningRDD.
>> http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/rdd/PartitionPruningRDD.html.
>> May be it could help you achieve what you are trying to do.
>>
>>
>> On Tue, Sep 10, 2013 at 6:24 PM, Xiang Huo <huoxiang5659@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I am try to get some sub dataset from one large dataset. I know one
>>> method is that i can run val small = big.filter(...) and then save this RDD
>>> as textFile for n times, where n is the number of sub dataset I want. But I
>>> wonder this there any way that I can traverse one time for the large
>>> dataset? Because in my case the large dataset is more than several TB and
>>> each record in it can only be classified in one sub dataset.
>>>
>>> Any help is appreciated.
>>>
>>> Thanks
>>>
>>> Xiang
>>> --
>>> Xiang Huo
>>> Department of Computer Science
>>> University of Illinois at Chicago(UIC)
>>> Chicago, Illinois
>>> US
>>> Email: huoxiang5659@gmail.com
>>>            or xhuo4@uic.edu
>>>
>>
>>
>


-- 
Xiang Huo
Department of Computer Science
University of Illinois at Chicago(UIC)
Chicago, Illinois
US
Email: huoxiang5659@gmail.com
           or xhuo4@uic.edu

Mime
View raw message