I am not sure why, but the mailing list is saying. "This post has NOT been accepted by the mailing list yet".

On Mon, 3 Apr 2017 at 20:52 mathewwicks [via Apache Spark User List] <[hidden email]> wrote:
Here is an example to illustrate my point.

In this toy example, we are collecting a list of the other products that each user has bought, and appending it as a new column. (Also note, that we are filtering on some arbitrary column 'good_bad'.)

I would like to know if we support NOT including the CURRENT ROW in the PARTITION BY.
(E.g. transaction 1 would have `other_purchases = [prod2, prod3]` rather than `other_purchases = [prod1, prod2, prod3]`)

------------------- Code Below -------------------

df = spark.createDataFrame([
    (1, "user1", "prod1", "good"),
    (2, "user1", "prod2", "good"),
    (3, "user1", "prod3", "good"),
    (4, "user2", "prod3", "bad"),
    (5, "user2", "prod4", "good"),
    (5, "user2", "prod5", "good")],
    ("trans_id", "user_id", "prod_id", "good_bad")

df = df.selectExpr(
    "COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END) OVER(PARTITION BY user_id) AS other_purchases"

This email was sent by mathewwicks (via Nabble)
To receive all replies by email, subscribe to this discussion

View this message in context: Re: Do we support excluding the current row in PARTITION BY windowing functions?
Sent from the Apache Spark User List mailing list archive at Nabble.com.