Something like this, maybe?


import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.expressions.AttributeReference
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.catalyst.encoders.RowEncoder

val df: DataFrame = ???
val spark = df.sparkSession
val rddOfInternalRows = df.queryExecution.toRdd.mapPartitions(iter => {
  log.info("Test")
  iter
})
val attributes = df.schema.map(f => 
   AttributeReference(f.name, f.dataType, f.nullable, f.metadata)()
)
val logicalPlan = LogicalRDD(attributes, rddOfInternalRows)(spark)
val rowEncoder = RowEncoder(df.schema)
val resultingDataFrame = new Dataset[Row](spark, logicalPlan, rowEncoder)
resultingDataFrame

On Mon, Aug 14, 2017 at 2:15 PM, Lukas Bradley <lukasbradley@gmail.com> wrote:
We have had issues with gathering status on long running jobs.  We have attempted to draw parallels between the Spark UI/Monitoring API and our code base.  Due to the separation between code and the execution plan, even having a guess as to where we are in the process is difficult.  The Job/Stage/Task information is too abstracted from our code to be easily digested by non Spark engineers on our team.

Is there a "hook" to which I can attach a piece of code that is triggered when a point in the plan is reached?  This could be when a SQL command completes, or when a new DataSet is created, anything really...  

It seems Dataset.checkpoint() offers an excellent snapshot position during execution, but I'm concerned I'm short-circuiting the optimal execution of the full plan.  I really want these trigger functions to be completely independent of the actual processing itself.  I'm not looking to extract information from a Dataset, RDD, or anything else.  I essentially want to write independent output for status.  

If this doesn't exist, is there any desire on the dev team for me to investigate this feature?

Thank you for any and all help.