spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From boclair <bocl...@gmail.com>
Subject jsonRdd and MapType
Date Fri, 07 Nov 2014 20:41:20 GMT
I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)). 
I'd like some of the json fields to be in a MapType rather than a sub
StructType, as the keys will be very sparse.

For example:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> val jsonRdd = sc.parallelize(Seq("""{"key": "1234", "attributes":
> {"gender": "m"}}""",
                                                   """{"key": "4321",
"attributes": {"location": "nyc"}}"""))
> val schemaRdd = sqlContext.jsonRDD(jsonRdd)
> schemaRdd.printSchema
root
 |-- attributes: struct (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- location: string (nullable = true)
 |-- key: string (nullable = true)
> schemaRdd.collect
res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
[[null,nyc],4321])


However this isn't what I want.  So I created my own StructType to pass to
the jsonRDD call:

> import org.apache.spark.sql._
> val st = StructType(Seq(StructField("key", StringType, false),
                                       StructField("attributes",
MapType(StringType, StringType, false))))
> val jsonRddSt = sc.parallelize(Seq("""{"key": "1234", "attributes":
> {"gender": "m"}}""",
                                                      """{"key": "4321",
"attributes": {"location": "nyc"}}"""))
> val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
> schemaRddSt.printSchema
root
 |-- key: string (nullable = false)
 |-- attributes: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = false)
> schemaRddSt.collect
***  Failure  ***
scala.MatchError: MapType(StringType,StringType,false) (of class
org.apache.spark.sql.catalyst.types.MapType)
	at org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
...

The schema of the schemaRDD is correct.  But it seems that the json cannot
be coerced to a MapType.  I can see at the line in the stack trace that
there is no case statement for MapType.  Is there something I'm missing?  Is
this a bug or decision to not support MapType with json?

Thanks,
Brian




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message