spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Николай Ижиков" <nizhikov....@gmail.com>
Subject Re: Spark Data Frame. PreSorded partitions
Date Mon, 04 Dec 2017 14:12:24 GMT
Hello, guys.

Thank you for answers!

 > I think pushing down a sort .... could make a big difference.
 > You can however proposes to the data source api 2 to be included.

Jörn, are you talking about this jira issue? - https://issues.apache.org/jira/browse/SPARK-15689
Is there any additional documentation I has to learn before making any proposition?



04.12.2017 14:05, Holden Karau пишет:
> I think pushing down a sort (or really more in the case where the data is already naturally
returned in sorted order on some column) could make a big difference. Probably the simplest
argument for a 
> lot of time being spent sorting (in some use cases) is the fact it's still one of the
standard benchmarks.
> 
> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfranke@gmail.com <mailto:jornfranke@gmail.com>>
wrote:
> 
>     I do not think that the data source api exposes such a thing. You can however proposes
to the data source api 2 to be included.
> 
>     However there are some caveats , because sorted can mean two different things (weak
vs strict order).
> 
>     Then, is really a lot of time lost because of sorting? The best thing is to not read
data that is not needed at all (see min/max indexes in orc/parquet or bloom filters in Orc).
What is not read
>     does not need to be sorted. See also predicate pushdown.
> 
>      > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com
<mailto:nizhikov.dev@gmail.com>> wrote:
>      >
>      > Cross-posting from @user.
>      >
>      > Hello, guys!
>      >
>      > I work on implementation of custom DataSource for Spark Data Frame API and
have a question:
>      >
>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data
inside a partition in my data source.
>      >
>      > Do I have a built-in option to tell spark that data from each partition already
sorted?
>      >
>      > It seems that Spark can benefit from usage of already sorted partitions.
>      > By using of distributed merge sort algorithm, for example.
>      >
>      > Does it make sense for you?
>      >
>      >
>      > 28.11.2017 18:42, Michael Artz пишет:
>      >> I'm not sure other than retrieving from a hive table that is already sorted. 
This sounds cool though, would be interested to know this as well
>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com
<mailto:nizhikov.dev@gmail.com> <mailto:nizhikov.dev@gmail.com <mailto:nizhikov.dev@gmail.com>>>
wrote:
>      >>    Hello, guys!
>      >>    I work on implementation of custom DataSource for Spark Data Frame
API and have a question:
>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can
sort data inside a partition in my data source.
>      >>    Do I have a built-in option to tell spark that data from each partition
already sorted?
>      >>    It seems that Spark can benefit from usage of already sorted partitions.
>      >>    By using of distributed merge sort algorithm, for example.
>      >>    Does it make sense for you?
>      >>    ---------------------------------------------------------------------
>      >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org <mailto:user-unsubscribe@spark.apache.org>
<mailto:user-unsubscribe@spark.apache.org <mailto:user-unsubscribe@spark.apache.org>>
>      >
>      > ---------------------------------------------------------------------
>      > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org>
>      >
> 
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org>
> 
> 
> 
> 
> -- 
> Twitter: https://twitter.com/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message