spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elitejyo <>
Subject RDD vs Broadcast
Date Mon, 15 Dec 2014 08:31:28 GMT
We are developing Spark framework wherein we are moving historical data into
RDD sets.

Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations
like filtering/mapping, etc on such RDDs.

Now there is a use case where a subset of the data in the RDD gets updated
and we have to recompute the values.

So far I have been able to think of below approaches -

Approach1 - broadcast the change: 
1. I have already filtered the historical RDD on scope
2. Whenever there is an update on the values, I apply a map phase on /RDD at
step1/ by doing a lookup on the broadcast, thereby creating a new RDD
3. now I do all the computations again on this new /RDD at step2/

1. Maintain historical data RDDs 
2. Maintain /Delta/ RDDs on the historical data. Since initially there are
no updates it will be an empty RDD
3. Whenever there is an update on the values, create a new /Delta/ RDD and
discard the old value
4. Recompute the values by doing a join between historical RDDs and /Delta/

Approach 3:
I had thought of /Delta/ RDD to be a streaming RDD as well where I keep
updating the same RDD and do re-computation. But as far as I understand it
can take streams from Flume or Kafka. Whereas in my case the values are
generated in the application itself based on user interaction.
Hence I cannot see any integration points of streaming RDD in my context.

Any suggestion on which approach is better or any other approach suitable
for this scenario.


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message