We are currently using Spark 2.2.0 and facing some challenges in sorting of data on multiple partitions.
We have tried below approaches:
- Spark SQL approach:
- var query = "select * from data distribute by " + userid + " sort by " + userid + ", " + time “
This query returns correct results in Hive but not in Spark SQL.
- var newDf = data.repartition(col(userud)).orderBy(userid, time)
- var newDf = data.repartition(col(userid)).sortWithinPartitions(userid,time)
But none of the above approach is giving correct results for sorting of data.
Please suggest what could be done for the same.
Thanks & Regards,
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. Your privacy is important to us. Accenture uses your personal data only in compliance with data protection laws. For further information on how Accenture processes your personal data, please see our privacy statement at https://www.accenture.com/us-en/privacy-policy.