spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fangshi Li (JIRA)" <>
Subject [jira] [Closed] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
Date Wed, 05 Sep 2018 03:22:00 GMT


Fangshi Li closed SPARK-24256.

> ExpressionEncoder should support user-defined types as fields of Scala case class and
> -------------------------------------------------------------------------------------------
>                 Key: SPARK-24256
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Fangshi Li
>            Priority: Major
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case
class, tuple and java bean class. Spark's Dataset natively supports these mentioned types,
but we find Dataset is not flexible for other user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although
we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record,
using such Avro typed Dataset has many limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but Avro types
cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's
reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields together
in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case class/tuple
with subfields being any other user-defined type with its own Encoder for serde.
> To address this issue, we propose a trait as a contract(between ExpressionEncoder and
any other user-defined Encoder) to enable case class/tuple/java bean to support user-defined
> With this proposed patch and our minor modification in AvroEncoder, we remove above-mentioned
limitations with cluster-default conf
= com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few quarters.
We want to propose to upstream as we think this is a useful feature to make Dataset more flexible
to user types.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message