spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng, Hao" <>
Subject RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file
Date Sat, 17 Jan 2015 09:42:55 GMT
Wow,  glad to know that it works well, and sorry, the Jira is another issue, which is not the
same case here.

From: Bagmeet Behera []
Sent: Saturday, January 17, 2015 12:47 AM
To: Cheng, Hao
Subject: Re: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet

Hi Cheng, Hao
   An update: I installed the latest binaries of Spark 1.2.0 (prebuild for Hadoop 2.4 and
later) and tried your suggestion. And it *works* perfectly!
   Therefore I would encourage you to post your reply on the archive for the advantage of

Thanks and best wishes,
BB (Bagmeet)

On Fri, Jan 16, 2015 at 11:20 AM, Bagmeet Behera <<>>
Hi Chen, Hao
 The awesome thing is: the way you suggest works perfectly on Spark 1.1.0. - I am testing
this on a old test installation with Spark 1.1.0 (installed from
with scala 2.10.4.

 Just fyi: This was because I could not create a HiveContext on the newer installation of
spark 1.2.0 (scala 2.10.4) - from Cloudera CDH release 5.3.0 - which gave some strange error
that looked like there is some incompatibility between hive and spark libraries. I can create
a post for this (if I find an appropriate user group, perhaps on cloudera side) but would
this be also the result of the bug you mention?

 BTW your reply is not in the archives. I guess this is also because of the bug in the current
version you mentioned?

 Many thanks for the reply.

On Fri, Jan 16, 2015 at 3:24 AM, Cheng, Hao <<>>
Hi, BB
   Ideally you can do the query like: select key, value.percent from mytable_data lateral
view explode(audiences) f as key, value limit 3;
   But there is a bug in HiveContext:
   I am working on it now, hopefully make a patch soon.

Cheng Hao

-----Original Message-----
From: BB [<>]
Sent: Friday, January 16, 2015 12:52 AM
Subject: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

Hi all,
  Any help on the following is very much appreciated.
  On a schemaRDD read from a parquet file (data within file uses AVRO model) using the HiveContext:
     I can't figure out how to 'select' or use 'where' clause, to filter rows on a field that
has a Map AVRO-data-type. I want to do a filtering using a given ('key' : 'value'). How could
I do this?

* the printSchema of the loaded schemaRDD is like so:

------ output snippet -----
    |-- created: long (nullable = false)
    |-- audiences: map (nullable = true)
    |    |-- key: string
    |    |-- value: struct (valueContainsNull = false)
    |    |    |-- percent: float (nullable = false)
    |    |    |-- cluster: integer (nullable = false)

* I dont get a result when I try to select on a specific value of the 'audience' like so:

      "SELECT created, audiences FROM mytable_data LATERAL VIEW
explode(audiences) adtab AS adcol WHERE audiences['key']=='tg_loh' LIMIT 10"

 sequence of commands on the spark-shell (a different query and output) is:

------ code snippet -----
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val parquetFile2 =
scala> parquetFile2.registerTempTable("mytable_data")
scala> hiveContext.cacheTable("mytable_data")

scala> hiveContext.sql("SELECT  audiences['key'], audiences['value']
scala> FROM
mytable_data LATERAL VIEW explode(audiences) adu AS audien LIMIT

------ output ---------

gives a list of nulls. I can see that there is data when I just do the following (output is

------ code snippet -----
scala> hiveContext.sql("SELECT audiences FROM mytable_data LATERAL VIEW
explode(audiences) tablealias AS colalias LIMIT

---- output --------------
[Map(tg_loh -> [0.0,1,Map()], tg_co -> [0.0,1,Map(tg_co_petrol -> 0.0)], tg_wall
-> [0.0,1,Map(tg_wall_poi -> 0.0)],  ...

Q1) What am I doing wrong?
Q2) How can I use 'where' in the query to filter on specific values?

What works:
   Queries with filtering, and selecting on fields that have simple AVRO data-types, such
as long or string works fine.


 I hope the explanation makes sense. Thanks.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:<>
For additional commands, e-mail:<>

View raw message