spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Николай Ижиков" <nizhikov....@gmail.com>
Subject Spark Data Frame. PreSorded partitions
Date Mon, 04 Dec 2017 06:50:36 GMT
Cross-posting from @user.

Hello, guys!

I work on implementation of custom DataSource for Spark Data Frame API and have a question:

If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition
in my data source.

Do I have a built-in option to tell spark that data from each partition already sorted?

It seems that Spark can benefit from usage of already sorted partitions.
By using of distributed merge sort algorithm, for example.

Does it make sense for you?


28.11.2017 18:42, Michael Artz пишет:
> I'm not sure other than retrieving from a hive table that is already sorted.  This sounds
cool though, would be interested to know this as well
> 
> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com <mailto:nizhikov.dev@gmail.com>>
wrote:
> 
>     Hello, guys!
> 
>     I work on implementation of custom DataSource for Spark Data Frame API and have a
question:
> 
>     If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside
a partition in my data source.
> 
>     Do I have a built-in option to tell spark that data from each partition already sorted?
> 
>     It seems that Spark can benefit from usage of already sorted partitions.
>     By using of distributed merge sort algorithm, for example.
> 
>     Does it make sense for you?
> 
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org <mailto:user-unsubscribe@spark.apache.org>
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message