spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <em...@yeikel.com>
Subject Serialization or internal functions?
Date Sat, 04 Apr 2020 18:07:11 GMT
Dear Community, 

 

Recently, I had to solve the following problem "for every entry of a
Dataset[String], concat a constant value" , and to solve it, I used built-in
functions : 

 

val data = Seq("A","b","c").toDS

 

scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit("
"),lit("concat"))).select("valueconcat").explain()

== Physical Plan ==

LocalTableScan [valueconcat#161]

 

As an alternative , a much simpler version of the program is to use map, but
it adds a serialization step that does not seem to be present for the
version above : 

 

scala> data.map(e=> s"$e concat").explain

== Physical Plan ==

*(1) SerializeFromObject [staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0,
java.lang.String, true], true, false) AS value#92]

+- *(1) MapElements <function1>, obj#91: java.lang.String

   +- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String

      +- LocalTableScan [value#12]

 

Is this over-optimization or is this the right way to go?  

 

As a follow up , is there any better API to get the one and only column
available in a DataSet[String] when using built-in functions?
"col(data.columns.head)" works but it is not ideal.

 

Thanks!


Mime
View raw message