spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mathewwicks <mathew.wi...@gmail.com>
Subject Do we support excluding the current row in PARTITION BY windowing functions?
Date Mon, 03 Apr 2017 08:52:26 GMT
Here is an example to illustrate my point.

In this toy example, we are collecting a list of the other products that
each user has bought, and appending it as a new column. (Also note, that we
are filtering on some arbitrary column 'good_bad'.) 

I would like to know if we support NOT including the CURRENT ROW in the
PARTITION BY. 
(E.g. transaction 1 would have `other_purchases = [prod2, prod3]` rather
than `other_purchases = [prod1, prod2, prod3]`)

------------------- Code Below -------------------

df = spark.createDataFrame([
    (1, "user1", "prod1", "good"), 
    (2, "user1", "prod2", "good"), 
    (3, "user1", "prod3", "good"), 
    (4, "user2", "prod3", "bad"), 
    (5, "user2", "prod4", "good"), 
    (5, "user2", "prod5", "good")], 
    ("trans_id", "user_id", "prod_id", "good_bad")
)
df.show()

df = df.selectExpr(
    "trans_id", 
    "user_id", 
    "COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END)
OVER(PARTITION BY user_id) AS other_purchases"
)
df.show()
----------------------------------------------------



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-support-excluding-the-current-row-in-PARTITION-BY-windowing-functions-tp28558.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message