spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yong Zhang <java8...@hotmail.com>
Subject Re: how to create all possible combinations from an array? how to join and explode row array?
Date Sat, 31 Mar 2018 02:24:55 GMT
What's wrong just using a UDF doing for loop in scala? You can change the for loop logic for
what combination you want.


scala> spark.version
res4: String = 2.2.1

scala> aggDS.printSchema
root
 |-- name: string (nullable = true)
 |-- colors: array (nullable = true)
 |    |-- element: string (containsNull = true)


scala> aggDS.show(false)
+----+----------------+
|name|colors          |
+----+----------------+
|john|[red, blue, red]|
|bill|[blue, red]     |
|sam |[gree]          |
+----+----------------+

scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf

scala> val loopUDF = udf { x: Seq[String] => for (a <- x; b <-x) yield (a,b) }
loopUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(_1,StringType,true),
StructField(_2,StringType,true)),true),Some(List(ArrayType(StringType,true))))

scala> aggDS.withColumn("newCol", loopUDF($"colors")).show(false)
+----+----------------+---------------------------------------------------------------------------------------------------------+
|name|colors          |newCol                                                            
                                      |
+----+----------------+---------------------------------------------------------------------------------------------------------+
|john|[red, blue, red]|[[red,red], [red,blue], [red,red], [blue,red], [blue,blue], [blue,red],
[red,red], [red,blue], [red,red]]|
|bill|[blue, red]     |[[blue,blue], [blue,red], [red,blue], [red,red]]                  
                                      |
|sam |[gree]          |[[gree,gree]]                                                     
                                      |
+----+----------------+-----------------------------------------------------------------

Yong


________________________________
From: Andy Davidson <Andy@SantaCruzIntegration.com>
Sent: Friday, March 30, 2018 8:58 PM
To: Andy Davidson; user
Subject: Re: how to create all possible combinations from an array? how to join and explode
row array?

I was a little sloppy when I created the sample output. Its missing a few pairs

Assume for a given row I have [a, b, c] I want to create something like the cartesian join

From: Andrew Davidson <Andy@SantaCruzIntegration.com<mailto:Andy@SantaCruzIntegration.com>>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: how to create all possible combinations from an array? how to join and explode row
array?

I have a dataframe and execute  df.groupBy(“xyzy”).agg( collect_list(“abc”)

This produces a column of type array. Now for each row I want to create a multiple pairs/tuples
from the array so that I can create a contingency table.  Any idea how I can transform my
data so that call crosstab() ? The join transformation operate on the entire dataframe. I
need something at the row array level?


Bellow is some sample python and describes what I would like my results to be?

Kind regards

Andy


c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})

df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()

root
 |-- a: string (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+----------------+
|   a|               b|
+----+----------------+
|john|[red, blue, red]|
|bill   |     [blue, red]|
| sam|         [green]|
+----+----------------+


The output I am trying to create is. I could live with a crossJoin (cartesian join) and add
my own filtering if it makes the problem easier?


+----+----------------+
|  x1|    x2|
+----+----------------+
red  | blue
red  | red
blue | red
+----+----------------+



Mime
View raw message