spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <>
Subject Re: A Spark Design Problem
Date Fri, 31 Oct 2014 18:56:11 GMT
Does the following help?

JavaPairRDD<bin,key> join with JavaPairRDD<bin,lock>

If you partition both RDDs by the bin id, I think you should be able to get
what you want.

Best Regards,
Nube Technologies <>


On Fri, Oct 31, 2014 at 11:19 PM, <> wrote:

> Hi Steve,
> Are you talking about sequence alignment ?
> —
> FG
> On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis <>
> wrote:
>>  The original problem is in biology but the following captures the CS
>> issues, Assume I  have a large number of locks and a large number of keys.
>> There is a scoring function between keys and locks and a key that  fits a
>> lock will have a high score. There may be many keys fitting one lock and a
>> key may fit no locks well. The object is to find the best fitting lock for
>> each key.
>> Assume that the number of keys and locks is high enough that taking the
>> cartesian product of the two is computationally impractical. Also assume
>> that keys and locks have an attached location which is accurate within an
>> error (say 1 Km). Only keys and locks within 1 Km need be compared.
>> Now assume I can create a JavaRDD<Keys> and a JavaRDD<Locks> . I could
>> divide the locations into 1 Km squared bins and look only within a few
>> bins. Assume that it is practical to take a cartesian product for all
>> elements in a bin but not to keep all elements in memory. I could map my
>> RDDs into PairRDDs where the key is the bin assigned by location
>> I know how to take the cartesian product of two JavaRDDs but not how to
>> take a cartesian product of sets of elements sharing a common key (bin),
>> Any suggestions. Assume that in the worst cases the number of elements in a
>> bin are too large to keep in memory although if a bin were subdivided into,
>> say 100 subbins elements would fit in memory.
>> Any thoughts as to how to attack the problem

View raw message