spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "songxin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-15547) Encoder validation is too strict for inner nested structs
Date Tue, 21 Feb 2017 11:26:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875823#comment-15875823
] 

songxin commented on SPARK-15547:
---------------------------------

yes ,you are absolutely right!

> Encoder validation is too strict for inner nested structs
> ---------------------------------------------------------
>
>                 Key: SPARK-15547
>                 URL: https://issues.apache.org/jira/browse/SPARK-15547
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 1.6.1, 2.0.0
>            Reporter: Cheng Lian
>            Assignee: Wenchen Fan
>             Fix For: 2.0.0
>
>
> The following Spark shell snippet reproduces this issue:
> {code}
> case class ClassData(a: String, b: Long)
> case class NestedStruct(f: ClassData)
> val data = Seq(
>   "{'f': {'b': 1, 'a': 'foo', 'c': 'extra'}}",
>   "{'f': {'b': 2, 'a': 'bar', 'c': 'extra'}}"
> )
> val df = spark.read.json(sc.parallelize(data))
> val ds = df.as[NestedStruct]
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Try to map struct<a:string,b:bigint,c:string>
to Tuple2, but failed as the number of fields does not line up.
>  - Input schema: struct<f:struct<a:string,b:bigint,c:string>>
>  - Target schema: struct<f:struct<a:string,b:bigint>>;
>   at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:267)
>   at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$validate$3.apply(ExpressionEncoder.scala:311)
>   at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$validate$3.apply(ExpressionEncoder.scala:307)
>   at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:307)
>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:168)
>   at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
>   at org.apache.spark.sql.Dataset.as(Dataset.scala:360)
>   ... 39 elided
> {noformat}
> We can workaround this exception by removing inner nested field {{c}}:
> {code}
> val data = Seq(
>   "{'f': {'b': 1, 'a': 'foo'}}",
>   "{'f': {'b': 2, 'a': 'bar'}}"
> )
> {code}
> Dataset is just a "view" of its underlying logical plan. The encoder and logical plan
of the Dataset may have different field number and/or field order. This is OK as long as all
fields referred by the encoder exist in the logical plan. This is because encoder de/serializer
expressions are resovled by name rather than ordinal.
> However, as illustrated above, length of an inner nested struct must align with field
number of the corresponding case class/Java bean, which is too strict a restriction.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message