spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <>
Subject RE: nested collection object query
Date Tue, 29 Sep 2015 13:27:18 GMT
You have 2 options:
Option 1:
Use lateral view explode, as you did below. But if you want to remove the duplicate, then
use distinct after that.
For example:
col1, col2, ArrayOf(Struct)
After explode:
col1, col2, employee0col1, col2, employee1col1, col2, employee0
Then select distinct col1, col2 from ... where'employee0'
Option 2: Implement your own UDF, to do the logic you want to do. In fact, in the Hive, there
is already one called array_contains(), which check if the array contain the data you want.
But in  your case, your data in the array is a struct, and you only want to compare name of
the struct, instead of whole struct. You need to override the equals() logic of array_contains()
in the Hive, so you have to implement that by custom UDF.
See the hive function of array_contains here:
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 23:02:41 -0700

Well I figure out a way to use explode. But it returns two rows if there is two match in nested
array objects.
select id from department LATERAL VIEW explode(employee) dummy_table as emp where
= 'employee0'
I was looking for an operator that loops through the array and return true if it matches the
condition and returns the parent object.
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 22:26:46 -0700

Thanks for you response Yong! Array syntax works fine. But I am not sure how to use explode.
Should I use as follows?
select id from department where explode(employee).name = 'employee0
This query gives me java.lang.UnsupportedOperationException . I am using HiveContext.
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 20:42:11 -0400

Your employee in fact is an array of struct, not just struct.
If you are using HiveSQLContext, then you can refer it like following:
select id from member where employee[0].name = 'employee0'
The employee[0] is pointing to the 1st element of the array. 
If you want to query all the elements in the array, then you have to use "explode" in the
See Hive document for this:

> Date: Mon, 28 Sep 2015 16:37:23 -0700
> From:
> To:
> Subject: nested collection object query
> Hi Friends,
> What is the right syntax to query on collection of nested object? I have a
> following schema and SQL. But it does not return anything. Is the syntax
> correct?
> root
>  |-- id: string (nullable = false)
>  |-- employee: array (nullable = false)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- id: string (nullable = false)
>  |    |    |-- name: string (nullable = false)
>  |    |    |-- speciality: string (nullable = false)
> select id from member where = 'employee0'
> Uploaded a test if some one want to try it out.
> <>
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
View raw message