spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject Serialization or internal functions?
Date Sat, 04 Apr 2020 18:07:11 GMT
Dear Community, 


Recently, I had to solve the following problem "for every entry of a
Dataset[String], concat a constant value" , and to solve it, I used built-in
functions : 


val data = Seq("A","b","c").toDS


scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit("

== Physical Plan ==

LocalTableScan [valueconcat#161]


As an alternative , a much simpler version of the program is to use map, but
it adds a serialization step that does not seem to be present for the
version above : 


scala>> s"$e concat").explain

== Physical Plan ==

*(1) SerializeFromObject [staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0,
java.lang.String, true], true, false) AS value#92]

+- *(1) MapElements <function1>, obj#91: java.lang.String

   +- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String

      +- LocalTableScan [value#12]


Is this over-optimization or is this the right way to go?  


As a follow up , is there any better API to get the one and only column
available in a DataSet[String] when using built-in functions?
"col(data.columns.head)" works but it is not ideal.



View raw message