spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kohki Nishio <tarop...@gmail.com>
Subject Re: Ordering pushdown for Spark Datasources
Date Tue, 06 Apr 2021 06:17:48 GMT
The log data is stored in Lucene and I have a custom data source to access
it. For example, the condition is log-level = INFO, this brings in a couple
of million records per partition. Then there are hundreds of partitions
involved in a query. Spark has to go through all the entries to show the
first 100 entries, that is the problem. But if Spark is aware of
datasource's ordering  support, it only needs to fetch 100 per partition...

I'm wondering if Spark could do a merge-sort to make this type of query
faster..

Thanks
-Kohki

On Mon, Apr 5, 2021 at 1:02 AM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Hi,
>
> A couple of clarifications:
>
>
>    1. How is the log data stored on say HDFS?
>    2. You stated show the first 100 entries for a given condition. That
>    condition is a predicate itself?
>
> There are articles for predicate pushdown in Spark. For example check
>
> Using Spark predicate push down in Spark SQL queries | DSE 6.0 Dev guide
> (datastax.com)
> <https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkPredicatePushdown.html#:~:text=A%20predicate%20push%20down%20filters,WHERE%20clauses%20to%20the%20database.>
>
> Although large is a relative term. So that a couple of millions is not
> that large. You can also try most of the following in spark-sql
>
> spark-sql> set adaptive.enabled = true;
> adaptive.enabled        true
> Time taken: 0.011 seconds, Fetched 1 row(s)
> spark-sql> set optimize.ppd=true;
> optimize.ppd    true
> Time taken: 0.011 seconds, Fetched 1 row(s)
> spark-sql> set cbo.enables= true;
> cbo.enables     true
> Time taken: 0.01 seconds, Fetched 1 row(s)
> spark-sql> set adaptive.enabled = true;
> adaptive.enabled        true
> Time taken: 0.01 seconds, Fetched 1 row(s)
>
> Spark SQL is influenced by Hive SQL so you can leverage the pushdown in
> Hive SQL.
>
> Check this link as well
>
> Spark SQL Performance Tuning by Configurations — SparkByExamples
> <https://sparkbyexamples.com/spark/spark-sql-performance-tuning-configurations/>
>
> HTH
>
>
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 4 Apr 2021 at 23:55, Kohki Nishio <taroplus@gmail.com> wrote:
>
>> Hello,
>>
>> I'm trying to use Spark SQL as a log analytics solution. As you might
>> guess, for most use-cases, data is ordered by timestamp and the amount of
>> data is large.
>>
>> If I want to show the first 100 entries (ordered by timestamp) for a
>> given condition, Spark Executor has to scan the whole entries to select the
>> top 100 by timestamp.
>>
>> I understand this behavior, however, some of the data sources such as
>> JDBC or Lucene can support ordering and in this case, the target data is
>> large (a couple of millions). I believe it is possible to pushdown
>> orderings to the data sources and make the executors return early.
>>
>> Here's my ask, I know Spark doesn't do such a thing... but I'm looking
>> for any pointers, references which might be relevant to this, or .. any
>> random idea would be appreciated. So far I found, some folks are working on
>> aggregation pushdown (SPARK-22390), but I don't see any current activity
>> for ordering pushdown.
>>
>> Thanks
>>
>>
>> --
>> Kohki Nishio
>>
>

-- 
Kohki Nishio

Mime
View raw message