spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishnu Kumar <mr.visku...@gmail.com>
Subject Re: Spark SQL sort by and collect by in multiple partitions
Date Thu, 03 Sep 2015 06:24:04 GMT
Hi,

Yes this is intended behavior. "ORDER BY" guarantees the total order in
output while  "SORT BY" guarantees the order within a partition.


Vishnu

On Thu, Sep 3, 2015 at 10:49 AM, Niranda Perera <niranda.perera@gmail.com>
wrote:

> Hi all,
>
> I have been using sort by and order by in spark sql and I observed the
> following
>
> when using SORT BY and collect results, the results are getting sorted
> partition by partition.
> example:
> if we have 1, 2, ... , 12 and 4 partitions and I want to sort it in
> descending order,
> partition 0 (p0) would have 12, 8, 4
> p1 = 11, 7, 3
> p2 = 10, 6, 2
> p3 = 9, 5, 1
>
> so collect() would return 12, 8, 4, 11, 7, 3, 10, 6, 2, 9, 5, 1
>
> BUT when I use ORDER BY and collect results
> p0 = 12, 11, 10
> p1 =  9, 8, 7
> .....
> so collect() would return 12, 11, .., 1 which is the desirable result.
>
> is this the intended behavior of SORT BY and ORDER BY or is there
> something I'm missing?
>
> cheers
>
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> https://pythagoreanscript.wordpress.com/
>

Mime
View raw message