Thanks for the response!Well, in retrospect each partition doesn't need to be restricted to a single key. But, I cannot have values associated with a key span partitions since they all need to be processed together for a key to facilitate cumulative calcs. So provided an individual key has all its values in a single partition, I'm OK.Additionally, the values will be written to the database, and from what I have read doing this at the partition level is the best compromise between 1) Writing the calculated values for each key (lots of connect/disconnects) and collecting them all at the end and writing them all at once.I am using a groupBy against the filtered RDD the get the grouping I want, but apparently this may not be the most efficient way, and it seems that everything is always in a single partition under this scenario.On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher <firstname.lastname@example.org> wrote:That seems like it could work, although I don't think `partitionByKey` is a thing, at least for RDD. You might be able to merge step #2 and step #3 into one step by using the `reduceByKey` function signature that takes in a Partitioner implementation.The tricky part might be getting the partitioner to know about the number of partitions, which I think it needs to know upfront in `. The `HashPartitioner` for example takes in the number as a constructor argument, maybe you could use that with an upper bound size if you don't mind empty partitions. Otherwise you might have to mess around to extract the exact number of keys if it's not readily available.Aside: what is the requirement to have each partition only contain the data related to one key?On Fri, Sep 4, 2015 at 11:06 AM, mmike87 <email@example.com> wrote:Hello, I am new to Apache Spark and this is my company's first Spark project.
Essentially, we are calculating models dealing with Mining data using Spark.
I am holding all the source data in a persisted RDD that we will refresh
periodically. When a "scenario" is passed to the Spark job (we're using Job
Server) the persisted RDD is filtered to the relevant mines. For example, we
may want all mines in Chile and the 1990-2015 data for each.
Many of the calculations are cumulative, that is when we apply user-input
"adjustment factors" to a value, we also need the "flexed" value we
calculated for that mine previously.
To ensure that this works, the idea if to:
1) Filter the superset to relevant mines (done)
2) Group the subset by the unique identifier for the mine. So, a group may
be all the rows for mine "A" for 1990-2015
3) I then want to ensure that the RDD is partitioned by the Mine Identifier
It's step 3 that is confusing me. I suspect it's very easy ... do I simply
We're using Java if that makes any difference.
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com