spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <iras...@cloudera.com>
Subject Re: Process time series RDD after sortByKey
Date Thu, 12 Mar 2015 00:35:17 GMT
this is a very interesting use case.  First of all, its worth pointing out
that if you really need to process the data sequentially, fundamentally you
are limiting the parallelism you can get.  Eg., if you need to process the
entire data set sequentially, then you can't get any parallelism.  If you
can process each hour separately, but need to process data within an hour
sequentially, then the max parallelism you can get for one days is 24.

But lets say you're OK with that.  Zhan Zhang solution is good if you just
want to process the entire dataset sequentially.  But what if you wanted to
process each hour separately, so you at least can create 24 tasks that can
be run in parallel for one day?  I think you would need to create your own
subclass of RDD that is similar in spirit to what CoalescedRDD does.  Your
RDD would have 24 partitions, and each partition would depend on some set
of partitions in its parent (your sorted RDD with 1000 partitions).  I
don't think you could use CoalescedRDD directly b/c you want more control
over the way the partitions get grouped together.

this answer is very similar to my answer to your other question about
controlling partitions , hope its helps! :)


On Mon, Mar 9, 2015 at 5:41 PM, Shuai Zheng <szheng.code@gmail.com> wrote:

> Hi All,
>
>
>
> I am processing some time series data. For one day, it might has 500GB,
> then for each hour, it is around 20GB data.
>
>
>
> I need to sort the data before I start process. Assume I can sort them
> successfully
>
>
>
> *dayRDD.sortByKey*
>
>
>
> but after that, I might have thousands of partitions (to make the sort
> successfully), might be 1000 partitions. And then I try to process the data
> by hour (not need exactly one hour, but some kind of similar time frame).
> And I can’t just re-partition size to 24 because then one partition might
> be too big to fit into memory (if it is 20GB). So is there any way for me
> to just can process underlying partitions by certain order? Basically I
> want to call mapPartitionsWithIndex with a range of index?
>
>
>
> Anyway to do it? Hope I describe my issue clear… J
>
>
>
> Regards,
>
>
>
> Shuai
>
>
>
>
>

Mime
View raw message