spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pankaj Wahane <>
Subject Re: Duplicate rows in windowing functions
Date Wed, 02 Nov 2016 08:43:16 GMT
Hi All,

I have a telemetry dataframe that is created using spark csv. I am using spark 2.0.1

def createViewFromCSV(viewName: String, pathToData: String, schema: StructType): DataFrame
= {
  val df =
    .option("header", "true")
    .option("timestampFormat", "yyyy-MM-dd hh:mm:ss")

val telemetry = createViewFromCSV("telemetry",
  dataFolder + "telemetry.csv",
    StructField("datetime", TimestampType, true),
    StructField("machineID", IntegerType, true),
    StructField("volt", DoubleType, true),
    StructField("rotate", DoubleType, true),
    StructField("pressure", DoubleType, true),
    StructField("vibration", DoubleType, true))))

when I display data,truncate = false)

It shows correct data.

I want to compute moving average for each row

val wSpec = Window.partitionBy("machineID").orderBy("datetime")

  .withColumn("avg_volt", avg(telemetry("volt")).over(wSpec))
  .show(100, truncate = false)

In output I am getting duplicate records – please notice the date after 6th hour. Please
help me with this and let me know the correct way to do this.


Pankaj Wahane
View raw message