spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Puneet Kapoor <>
Subject Save RDD with partition information
Date Tue, 13 Jan 2015 21:30:26 GMT

I have a usecase where in I have hourly spark job which creates hourly
RDDs, which are partitioned by keys.

At the end of the day I need to access all of these RDDs and combine the
Key/Value pairs over the day.

If there is a key K1 in RDD0 (1st hour of day), RDD1 ... RDD23(last hour of
the day); we need to combine all the values of this K1 using some logic.

What I want to do is to avoid the shuffling at the end of the day since the
data in huge ~ hundreds of GB.

1.) Is there a way that i can persist hourly RDDs with partition
information and then while reading back the RDDs the partition information
is restored.
2.) Can i ensure that partitioning is similar for different hours. Like if
K1 goes to container_X, it would go to the same container in the next hour
and so on.


View raw message