spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brandon Geise <brandonge...@gmail.com>
Subject Re: how to create all possible combinations from an array? how to join and explode row array?
Date Sat, 31 Mar 2018 01:15:04 GMT
Possibly instead of doing the initial grouping, just do a full outer join on zyzy.  This is
in scala but should be easily convertible to python.

 

val data = Array(("john", "red"), ("john", "blue"), ("john", "red"), ("bill", "blue"), ("bill",
"red"), ("sam", "green"))

    val distData: DataFrame = spark.sparkContext.parallelize(data).toDF("a", "b")

    distData.show()

+----+-----+

| a| b|

+----+-----+

|john| red|

|john| blue|

|john| red|

|bill| blue|

|bill| red|

| sam|green|

+----+-----+

 

 

distData.as("tbl1").join(distData.as("tbl2"), Seq("a"), "fullouter").select("tbl1.b", "tbl2.b").distinct.show()

 

+-----+-----+

| b| b|

+-----+-----+

| blue| red|

| red| blue|

| red| red|

| blue| blue|

|green|green|

+-----+-----+

 

 

From: Andy Davidson <Andy@SantaCruzIntegration.com>
Date: Friday, March 30, 2018 at 8:58 PM
To: Andy Davidson <Andy@SantaCruzIntegration.com>, user <user@spark.apache.org>
Subject: Re: how to create all possible combinations from an array? how to join and explode
row array?

 

I was a little sloppy when I created the sample output. Its missing a few pairs

 

Assume for a given row I have [a, b, c] I want to create something like the cartesian join

 

From: Andrew Davidson <Andy@SantaCruzIntegration.com>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <user@spark.apache.org>
Subject: how to create all possible combinations from an array? how to join and explode row
array?

 

I have a dataframe and execute  df.groupBy(“xyzy”).agg( collect_list(“abc”)

 

This produces a column of type array. Now for each row I want to create a multiple pairs/tuples
from the array so that I can create a contingency table.  Any idea how I can transform my
data so that call crosstab() ? The join transformation operate on the entire dataframe. I
need something at the row array level?





Bellow is some sample python and describes what I would like my results to be?



Kind regards



Andy

 

 

c1 = ["john", "bill", "sam"]

c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]

p = pd.DataFrame({"a":c1, "b":c2})

 

df = sqlContext.createDataFrame(p)

df.printSchema()

df.show()

 

root

 |-- a: string (nullable = true)

 |-- b: array (nullable = true)

 |    |-- element: string (containsNull = true)

 

+----+----------------+

|   a|               b|

+----+----------------+

|john|[red, blue, red]|

|bill   |     [blue, red]|

| sam|         [green]|

+----+----------------+

 

 

The output I am trying to create is. I could live with a crossJoin (cartesian join) and add
my own filtering if it makes the problem easier?

 

 

+----+----------------+

|  x1|    x2|

+----+----------------+

red  | blue

red  | red

blue | red

+----+----------------+

 

 


Mime
View raw message