spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chetan Khatri <chetan.opensou...@gmail.com>
Subject dropDuplicate on timestamp based column unexpected output
Date Thu, 04 Apr 2019 04:51:53 GMT
Hello Dear Spark Users,

I am using dropDuplicate on a DataFrame generated from large parquet file
from(HDFS) and doing dropDuplicate based on timestamp based column, every
time I run it drops different - different rows based on same timestamp.

What I tried and worked

val wSpec = Window.partitionBy($"invoice_id").orderBy($"update_time".desc)

val irqDistinctDF = irqFilteredDF.withColumn("rn",
row_number.over(wSpec)).where($"rn" === 1) .drop("rn").drop("update_time")

But this is damn slow...

Can someone please throw a light.

Thanks

Mime
View raw message