spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Don Drake <dondr...@gmail.com>
Subject Spark 2 - Creating datasets from dataframes with extra columns
Date Thu, 02 Feb 2017 20:46:04 GMT
In 1.6, when you created a Dataset from a Dataframe that had extra columns,
the columns not in the case class were dropped from the Dataset.

For example in 1.6, the column c4 is gone:

scala> case class F(f1: String, f2: String, f3:String)

defined class F


scala> import sqlContext.implicits._

import sqlContext.implicits._


scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
"j","z")).toDF("f1", "f2", "f3", "c4")

df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string,
c4: string]


scala> val ds = df.as[F]

ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]


scala> ds.show

+---+---+---+

| f1| f2| f3|

+---+---+---+

|  a|  b|  c|

|  d|  e|  f|

|  h|  i|  j|


This seems to have changed in Spark 2.0 and also 2.1:

Spark 2.1.0:

scala> case class F(f1: String, f2: String, f3:String)
defined class F

scala> import spark.implicits._
import spark.implicits._

scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more
fields]

scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more
fields]

scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

Is there a way to get a Dataset that conforms to the case class in Spark
2.1.0?  Basically, I'm attempting to use the case class to define an output
schema, and these extra columns are getting in the way.

Thanks.

-Don

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Mime
View raw message