spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kohki Nishio <>
Subject Ordering pushdown for Spark Datasources
Date Sun, 04 Apr 2021 22:53:58 GMT

I'm trying to use Spark SQL as a log analytics solution. As you might
guess, for most use-cases, data is ordered by timestamp and the amount of
data is large.

If I want to show the first 100 entries (ordered by timestamp) for a given
condition, Spark Executor has to scan the whole entries to select the
top 100 by timestamp.

I understand this behavior, however, some of the data sources such as JDBC
or Lucene can support ordering and in this case, the target data is large
(a couple of millions). I believe it is possible to pushdown orderings to
the data sources and make the executors return early.

Here's my ask, I know Spark doesn't do such a thing... but I'm looking for
any pointers, references which might be relevant to this, or .. any random
idea would be appreciated. So far I found, some folks are working on
aggregation pushdown (SPARK-22390), but I don't see any current activity
for ordering pushdown.


Kohki Nishio

View raw message