spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron <aaron.doss...@target.com>
Subject RE: Efficiently doing an analysis with Cartesian product (pyspark)
Date Wed, 25 Jun 2014 13:10:21 GMT
Thank you, Mayur.  Could you provide some pseudo code for what the direct lookup would be like?
 I have struggled to implement that.

I ended up doing a Cartesian product of (key, values) to itself.  Something like this…

mappedToLines = input.map(lambda line: line.split())
items = mappedToLines.map(lambda x: (x[0], x[1])).groupByKey()
itemPairs = items.cartesian(items)
TC = itemPairs.map(lambda x: (x[0][0], x[1][0], <calculate something about the values>)

That actually ended up being incredibly memory efficient, perhaps because the second line
of code creates a few millions keys, but each key typically has less than 10 values grouped
to it.

I do end up with twice as many entries as I need though  (x, y, <calculated value>)
has the same calculated value as (y, x, <calculated value>).  Would there be a good
way to eliminate that?

Thank you again!  -Aaron

From: Mayur Rustagi [via Apache Spark User List] [mailto:ml-node+s1001560n8206h42@n3.nabble.com]
Sent: Tuesday, June 24, 2014 5:39 PM
To: Aaron.Dossett
Subject: Re: Efficiently doing an analysis with Cartesian product (pyspark)

How about this..
map it to key,value pair, then reducebykey using max operation
Then in the rdd you can do join with your lookup data & reduce (if you only wanna lookup
2 values then you canuse lookup directly as well).
PS: these are list of operations in Scala, I am not aware how far pyspark api is in those.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi<https://twitter.com/mayur_rustagi>


On Tue, Jun 24, 2014 at 3:33 AM, Aaron <[hidden email]</user/SendEmail.jtp?type=node&node=8206&i=0>>
wrote:
Sorry, I got my sample outputs wrong

(1,1) -> 400
(1,2) -> 500
(2,2)-> 600

On Jun 23, 2014, at 4:29 PM, "Aaron Dossett [via Apache Spark User List]" <[hidden email]<http://user/SendEmail.jtp?type=node&node=8145&i=0>>
wrote:
I am relatively new to Spark and am getting stuck trying to do the following:

- My input is integer key, value pairs where the key is not unique.  I'm interested in information
about all possible distinct key combinations, thus the Cartesian product.
- My first attempt was to create a separate RDD of this cartesian product and then use map()
to calculate the data.  However, I was trying to pass another RDD to the function map was
calling, which I eventually figured out was causing a run time error, even if the function
I called with map did nothing.  Here's a simple code example:

-------
def somefunc(x, y, RDD):
  return 0

input = sc.parallelize([(1,100), (1,200), (2, 100), (2,300)])

#Create all pairs of keys, including self-pairs
itemPairs = input.map(lambda x: x[0]).distinct()
itemPairs = itemPairs.cartesian(itemPairs)

print itemPairs.collect()

TC = itemPairs.map(lambda x: (x, somefunc(x[0], x[1], input)))

print TC.collect()
------

I'm assuming this isn't working because it isn't a very Spark-like way to do things and I
could imagine that passing RDDs into other RDD's map functions might not make sense.  Could
someone suggest to me a way to apply transformations and actions to "input" that would produce
a mapping of key pairs to some information related to the values.

For example, I might want to (1, 2) to map to the sum of the maximum values found for each
key in the input (500 in my sample data above).  Extending that example (1,1) would map to
300 and (2,2) to 400.

Please let me know if I should provide more details or a more robust example.

Thank you, Aaron
________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144.html
This email was sent by Aaron Dossett<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=1353>
(via Nabble)
To receive all replies by email, subscribe to this discussion<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=subscribe_by_code&node=8144&code=YWFyb24uZG9zc2V0dEB0YXJnZXQuY29tfDgxNDR8MTM3NjcxOTg5>

________________________________
View this message in context: Re: Efficiently doing an analysis with Cartesian product (pyspark)<http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144p8145.html>

Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>
at Nabble.com.


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144p8206.html
To unsubscribe from Efficiently doing an analysis with Cartesian product (pyspark), click
here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8144&code=YWFyb24uZG9zc2V0dEB0YXJnZXQuY29tfDgxNDR8MTM3NjcxOTg5>.
NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144p8255.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message