spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yong Zhang <java8...@hotmail.com>
Subject Re: Need help for RDD/DF transformation.
Date Thu, 30 Mar 2017 13:31:34 GMT
Unfortunately, I don't think there is any optimized way to do this. Maybe someone else can
correct me, but in theory, there is no way other than a cartesian product of your 2 sides
if you can not change the data.


Think about it, if you want to join between 2 different types (Array and Int in your case),
Spark cannot do HashJoin, nor SortMergeJoin. In the RelationDB world, you have to do a NestedLoop,
which is cartesian join in BigData world. After you create a cartesian product of both, then
check if Array column contains the other column.


I saw a similar question answered in stackoverflow<http://stackoverflow.com/questions/41595099/spark-join-dataframe-column-with-an-array>,
but the answer is NOT correct for serious case.

Assume the key never contains any duplicate strings:


scala> sc.version
res1: String = 1.6.3

scala> val df1 = Seq((1, "a"), (3, "a"), (5, "b")).toDF("key", "value")
df1: org.apache.spark.sql.DataFrame = [key: int, value: string]

scala> val df2 = Seq((Array(1,2,3), "other1"),(Array(4,5), "other2")).toDF("keys", "other")
df2: org.apache.spark.sql.DataFrame = [keys: array<int>, other: string]

scala> df1.show
+---+-----+
|key|value|
+---+-----+
|  1|    a|
|  3|    a|
|  5|    b|
+---+-----+


scala> df2.show
+---------+------+
|     keys| other|
+---------+------+
|[1, 2, 3]|other1|
|   [4, 5]|other2|
+---------+------+


scala> df1.join(df2, df2("keys").cast("string").contains(df1("key").cast("string"))).show
+---+-----+---------+------+
|key|value|     keys| other|
+---+-----+---------+------+
|  1|    a|[1, 2, 3]|other1|
|  3|    a|[1, 2, 3]|other1|
|  5|    b|   [4, 5]|other2|
+---+-----+---------+------+

This code looks like working in Spark 1.6.x, but in fact, it has serious issue. As it assumes
that the key will never have any conflict string in it, but this won't be true for any serious
data.
So [11,2,3] contains "1" will return true, but is broken in your logic.

And it won't work in Spark 2.x, as the cast("string") logic changes for Array in Spark 2.x.
The idea behind whole thing is to transfer your Array field to a String type, and use contains
method to check if it contains another field (In String type too). But this is impossible
to match the Array.contains(element) logic in most cases.

You need to know your data, then try to see if you can find any optimized way to avoid cartesian
product. For example, maybe make sure "key" in DF1, always guarantee presenting the first
element of the Array in a logic order, so you can just pick the first element out from the
Array "keys" of DF2, to join. Otherwise, I don't see any way to avoid a cartesian join.

Yong

________________________________
From: Mungeol Heo <mungeol.heo@gmail.com>
Sent: Thursday, March 30, 2017 3:05 AM
To: ayan guha
Cc: Yong Zhang; user@spark.apache.org
Subject: Re: Need help for RDD/DF transformation.

Hello ayan,

Same key will not exists in different lists.
Which means, If "1" exists in a list, then it will not be presented in
another list.

Thank you.

On Thu, Mar 30, 2017 at 3:56 PM, ayan guha <guha.ayan@gmail.com> wrote:
> Is it possible for one key in 2 groups in rdd2?
>
> [1,2,3]
> [1,4,5]
>
> ?
>
> On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo <mungeol.heo@gmail.com> wrote:
>>
>> Hello Yong,
>>
>> First of all, thank your attention.
>> Note that the values of elements, which have values at RDD/DF1, in the
>> same list will be always same.
>> Therefore, the "1" and "3", which from RDD/DF 1, will always have the
>> same value which is "a".
>>
>> The goal here is assigning same value to elements of the list which
>> does not exist in RDD/DF 1.
>> So, all the elements in the same list can have same value.
>>
>> Or, the final RDD/DF also can be like this,
>>
>> [1, 2, 3], a
>> [4, 5], b
>>
>> Thank you again.
>>
>> - Mungeol
>>
>>
>> On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang <java8964@hotmail.com> wrote:
>> > What is the desired result for
>> >
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, c
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> >
>> > Yong
>> >
>> > ________________________________
>> > From: Mungeol Heo <mungeol.heo@gmail.com>
>> > Sent: Wednesday, March 29, 2017 5:37 AM
>> > To: user@spark.apache.org
>> > Subject: Need help for RDD/DF transformation.
>> >
>> > Hello,
>> >
>> > Suppose, I have two RDD or data frame like addressed below.
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, a
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> > I need to create a new RDD/DF like below from RDD/DF 1 and 2.
>> >
>> > 1, a
>> > 2, a
>> > 3, a
>> > 4, b
>> > 5, b
>> >
>> > Is there an efficient way to do this?
>> > Any help will be great.
>> >
>> > Thank you.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
> --
> Best Regards,
> Ayan Guha

Mime
View raw message