spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pankaj Wahane <pankajwah...@live.com>
Subject Re: Duplicate rows in windowing functions
Date Wed, 02 Nov 2016 08:43:16 GMT
Hi All,

I have a telemetry dataframe that is created using spark csv. I am using spark 2.0.1

def createViewFromCSV(viewName: String, pathToData: String, schema: StructType): DataFrame
= {
  val df = spark.read
    .option("header", "true")
    .option("timestampFormat", "yyyy-MM-dd hh:mm:ss")
    .schema(schema)
    .csv(pathToData).toDF()
  df.cache()
  df.createOrReplaceTempView(viewName)
 df
}

val telemetry = createViewFromCSV("telemetry",
  dataFolder + "telemetry.csv",
  StructType(Array(
    StructField("datetime", TimestampType, true),
    StructField("machineID", IntegerType, true),
    StructField("volt", DoubleType, true),
    StructField("rotate", DoubleType, true),
    StructField("pressure", DoubleType, true),
    StructField("vibration", DoubleType, true))))


when I display data

telemetry.show(20,truncate = false)


It shows correct data.

I want to compute moving average for each row

val wSpec = Window.partitionBy("machineID").orderBy("datetime")

telemetry
  .withColumn("avg_volt", avg(telemetry("volt")).over(wSpec))
  .show(100, truncate = false)


In output I am getting duplicate records – please notice the date after 6th hour. Please
help me with this and let me know the correct way to do this.

[cid:image001.png@01D23513.31CB8D40]

Thanks,
Pankaj Wahane
Mime
View raw message