spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Som Lima <somplastic...@gmail.com>
Subject Re: Serialization or internal functions?
Date Tue, 07 Apr 2020 18:52:36 GMT
Go to localhost:4040

While sparksession is running.

Go to localhost:4040

Select Stages from menu option.

Select Job you are interested in.


You can select additional metrics

Including  DAG visualisation.





On Tue, 7 Apr 2020, 17:14 yeikel valdes, <email@yeikel.com> wrote:

> Thanks for your input Soma , but I am actually looking to understand the
> differences and not only on the performance.
>
> ---- On Sun, 05 Apr 2020 02:21:07 -0400 * somplasticllc@gmail.com
> <somplasticllc@gmail.com> * wrote ----
>
> If you want to  measure optimisation in terms of time taken , then here is
> an idea  :)
>
>
> public class MyClass {
>     public static void main(String args[])
>     throws InterruptedException
>     {
>           long start  =  System.currentTimeMillis();
>
> // replace with your add column code
> // enough data to measure
>        Thread.sleep(5000);
>
>      long end  = System.currentTimeMillis();
>
>        int timeTaken = 0;
>       timeTaken = (int) (end  - start );
>
>       System.out.println("Time taken  " + timeTaken) ;
>     }
> }
>
> On Sat, 4 Apr 2020, 19:07 , <email@yeikel.com> wrote:
>
> Dear Community,
>
>
>
> Recently, I had to solve the following problem “for every entry of a
> Dataset[String], concat a constant value” , and to solve it, I used
> built-in functions :
>
>
>
> val data = Seq("A","b","c").toDS
>
>
>
> scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit("
> "),lit("concat"))).select("valueconcat").explain()
>
> == Physical Plan ==
>
> LocalTableScan [valueconcat#161]
>
>
>
> As an alternative , a much simpler version of the program is to use map,
> but it adds a serialization step that does not seem to be present for the
> version above :
>
>
>
> scala> data.map(e=> s"$e concat").explain
>
> == Physical Plan ==
>
> *(1) SerializeFromObject [staticinvoke(class
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0,
> java.lang.String, true], true, false) AS value#92]
>
> +- *(1) MapElements <function1>, obj#91: java.lang.String
>
>    +- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String
>
>       +- LocalTableScan [value#12]
>
>
>
> Is this over-optimization or is this the right way to go?
>
>
>
> As a follow up , is there any better API to get the one and only column
> available in a DataSet[String] when using built-in functions?
> “col(data.columns.head)” works but it is not ideal.
>
>
>
> Thanks!
>
>
>

Mime
View raw message