spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam hendley (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24202) Separate SQLContext dependency from SparkSession.implicits
Date Tue, 31 Dec 2019 18:56:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006214#comment-17006214
] 

Sam hendley commented on SPARK-24202:
-------------------------------------

I agree that this would be a very valuable change, was there a reason this was closed without
comment?

> Separate SQLContext dependency from SparkSession.implicits
> ----------------------------------------------------------
>
>                 Key: SPARK-24202
>                 URL: https://issues.apache.org/jira/browse/SPARK-24202
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Gerard Maas
>            Priority: Major
>              Labels: bulk-closed
>
> The current implementation of the implicits in SparkSession passes the current active
SQLContext to the SQLImplicits class. This implies that all usage of these (extremely helpful) implicits require
the prior creation of a Spark Session instance.
> Usage is typically done as follows:
>  
> {code:java}
> val sparkSession = SparkSession.builder()
> ....getOrCreate()
> import sparkSession.implicits._
> {code}
>  
> This is OK in user code, but it burdens the creation of library code that uses Spark,
where  static imports for _Encoder_ support is required.
> A simple example would be:
>  
> {code:java}
> class SparkTransformation[In: Encoder, Out: Encoder] {
>     def transform(ds: Dataset[In]): Dataset[Out]
> }
> {code}
>  
> Attempting to compile such code would result in the following exception:
> {code:java}
> Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String,
etc) and Product types (case classes) are supported by importing spark.implicits._  Support
for serializing other types will be added in future releases.{code}
> The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two utilities
to transform _RDD_ and local collections into a _Dataset_.
> These are 2 methods of the 46 implicit conversions offered by this class.
> The request is to separate the two implicit methods that depend on the SQLContext instance
creation into a separate class:
> {code:java}
> SQLImplicits#214-229
> /**
>  * Creates a [[Dataset]] from an RDD.
>  *
>  * @since 1.6.0
>  */
> implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
>  DatasetHolder(_sqlContext.createDataset(rdd))
> }
> /**
>  * Creates a [[Dataset]] from a local Seq.
>  * @since 1.6.0
>  */
> implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
>  DatasetHolder(_sqlContext.createDataset(s))
> }{code}
> By separating the static methods from these two methods that depend on _sqlContext_
into  separate classes, we could provide static imports for all the other functionality and
only require the instance-bound  implicits for the RDD and collection support (Which is
an uncommon use case these days)
> As this is potentially breaking the current interface, this might be a candidate for
Spark 3.0. Although there's nothing stopping us from creating a separate hierarchy for the
static encoders already. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message