spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Cossu <elco...@gmail.com>
Subject Re: Help explaining explain() after DataFrame join reordering
Date Tue, 05 Jun 2018 08:38:20 GMT
Hello,

as explained here
<https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html>,
the join order can be changed by the optimizer. The difference introduced
in Spark 2.2 is that the reordering is based on statistics instead of
heuristics, that can appear "random" and for some cases decrease the
performances.
If you want to control more the join order you can define your own Rule, an
example here.
<http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/>

Best,

Matteo


On 1 June 2018 at 18:31, Mohamed Nadjib MAMI <mohamed.nadjib.mami@gmail.com>
wrote:

> Dear Sparkers,
>
> I'm loading into DataFrames data from 5 sources (using official
> connectors): Parquet, MongoDB, Cassandra, MySQL and CSV. I'm then joining
> those DataFrames in two different orders.
> - mongo * cassandra * jdbc * parquet * csv (random order).
> - parquet * csv * cassandra * jdbc * mongodb (optimized order).
>
> The first follows a random order, whereas the second I'm deciding based on
> some optimization techniques (can provide details for the interested
> readers or if needed here).
>
> After the evaluation on increasing sizes of data, the optimization
> techniques I developed didn't improve the performance very noticeably. I
> inspected the Logical/Physical plan of the final joined DataFrame (using
> `explain(true)`). The 1st order was respected, whereas the 2nd order, it
> turned out, wasn't respected, and MongoDB was queried first.
>
> However, that what it seemed to me, I'm not quite confident reading the
> Plans (returned using explain(true)). Could someone help explaining the
> `explain(true)` output? (pasted in this gist
> <https://gist.github.com/mnmami/387c24de5dca86c3b8efe170965e9dcf>). Is
> there a way we could enforce the given order?
>
> I'm using Spark 2.1, so I think it doesn't include the new cost-based
> optimizations (introduced in Spark 2.2).
>
> *Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
> تحياتي.*
> *Mohamed Nadjib Mami*
> *Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University*
> *About me! <http://mohamednadjibmami.com>*
> *LinkedIn <http://fr.linkedin.com/in/mohamednadjibmami/>*
>

Mime
View raw message