spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BB <bagme...@gmail.com>
Subject using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file
Date Thu, 15 Jan 2015 16:52:04 GMT
Hi all,
  Any help on the following is very much appreciated.
=================================
Problem:
  On a schemaRDD read from a parquet file (data within file uses AVRO model)
using the HiveContext:
     I can't figure out how to 'select' or use 'where' clause, to filter
rows on a field that has a Map AVRO-data-type. I want to do a filtering
using a given ('key' : 'value'). How could I do this?

Details:
* the printSchema of the loaded schemaRDD is like so:

------ output snippet -----
    |-- created: long (nullable = false)
    |-- audiences: map (nullable = true)
    |    |-- key: string
    |    |-- value: struct (valueContainsNull = false)
    |    |    |-- percent: float (nullable = false)
    |    |    |-- cluster: integer (nullable = false)
----------------------------- 

* I dont get a result when I try to select on a specific value of the
'audience' like so:
     
      "SELECT created, audiences FROM mytable_data LATERAL VIEW
explode(audiences) adtab AS adcol WHERE audiences['key']=='tg_loh' LIMIT 10"

 sequence of commands on the spark-shell (a different query and output) is:

------ code snippet -----
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val parquetFile2 =
hiveContext.parquetFile("/home/myuser/myparquetfile")
scala> parquetFile2.registerTempTable("mytable_data")
scala> hiveContext.cacheTable("mytable_data")

scala> hiveContext.sql("SELECT  audiences['key'], audiences['value'] FROM
mytable_data LATERAL VIEW explode(audiences) adu AS audien LIMIT
3").collect().foreach(println)

------ output ---------
[null,null]
[null,null]
[null,null]
------------------------

gives a list of nulls. I can see that there is data when I just do the
following (output is truncated):

------ code snippet -----
scala> hiveContext.sql("SELECT audiences FROM mytable_data LATERAL VIEW
explode(audiences) tablealias AS colalias LIMIT
1").collect().foreach(println)

---- output --------------
[Map(tg_loh -> [0.0,1,Map()], tg_co -> [0.0,1,Map(tg_co_petrol -> 0.0)],
tg_wall -> [0.0,1,Map(tg_wall_poi -> 0.0)],  ...
------------------------

Q1) What am I doing wrong?
Q2) How can I use 'where' in the query to filter on specific values?

What works:
   Queries with filtering, and selecting on fields that have simple AVRO
data-types, such as long or string works fine.

===========================

 I hope the explanation makes sense. Thanks.
Best,
BB



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-hiveContext-to-select-a-nested-Map-data-type-from-an-AVROmodel-parquet-file-tp21168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message