Just realized you only want to keep first element. You can do this without sorting by doing something similar to min or max operation using a custom aggregator/udaf or reduceGroups on Dataset. This is also more efficient.
Hi All ,I want to do a dataframe operation to find the rows having the latest timestamp in each group using the below operationdf.orderBy(desc("transaction_
date")).groupBy("mobileno"). agg(first("customername").as(" customername"),first("service_ type").as("service_type"),firs t("cust_addr").as("cust_abbr") )
service_type","mobileno"," cust_addr")Spark Version :: 1.6.xMy Question is "Will Spark guarantee the Order while doing the groupBy , if DF is ordered using OrderBy previously in Spark 1.6.x"??I referred a blog here :: https://bzhangusc.wordpress. com/2015/05/28/groupby-on- dataframe-is-not-the-groupby- on-rdd/Which claims it will work except in Spark 1.5.1 and 1.5.2 .I need a bit elaboration of how internally spark handles it ? also is it more efficient than using a Window function ?Thanks in Advance ,Rabin Banerjee