spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Lo (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
Date Tue, 16 Jul 2019 15:15:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886192#comment-16886192
] 

David Lo edited comment on SPARK-19477 at 7/16/19 3:14 PM:
-----------------------------------------------------------

--[~cloud_fan] -Could you please elaborate on your suggested workaround? `ds.map(equality)`
isn't compiling for me. Like the original poster, I am interested in dropping columns that
are not defined in a case class when doing a DataFrame to Dataset[case class] conversion.-

 

Edit: Nevermind, I saw Christophe's comment in the ticket history about using `.map(identity)`
and that seems to work.


was (Author: davlo):
[~cloud_fan] Could you please elaborate on your suggested workaround? `ds.map(equality)`
isn't compiling for me. Like the original poster, I am interested in dropping columns that
are not defined in a case class when doing a DataFrame to Dataset[case class] conversion.

 

> [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-19477
>                 URL: https://issues.apache.org/jira/browse/SPARK-19477
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Don Drake
>            Priority: Major
>              Labels: bulk-closed
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, the columns
not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1",
"f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1",
"f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, f3[0]:
string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message