spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Bonnin <antoine.bon...@c-ways.com>
Subject Optimize sort merge join
Date Sat, 27 Jan 2018 14:17:08 GMT
Hi all,

I'm relatively new to spark and something is bothering me for optimizing
sort merge join from parquet.

My work consists to get stats on purchases for a retail company.
For example, i have to calculate the mean purchase over a period, for a
segment of prodcuts and a segment of client.

This informations are in different tables so i have to join them  :

   1. a client table : ID_CLIENT, CLIENT_SEG
   2. a ticket table : ID_CLIENT, ID_TICKET, DATE
   3. a detailed ticket table : ID_CLIENT, ID_TICKET, ID_PRODUCT,
   PRODUCT_SEG

For improving speed, I tried to save parquet files after a hashrepartition
on keys, but the reload of those parquet files still need a lot of shuffle
for the sort merge join.

How to shuffle data once and for all for speeding requests ?

Thanks,


*Antoine Bonnin*

Data scientist

[image: C-Ways]

*The smart way to your clients*

[image: Mail] antoine.bonnin@c-ways.com <antoine.bonnin@c-ways.com>

[image: Tél.] 06 65 37 99 60

[image: Web] www.c-ways.com <http://www.c-ways.com>

[image: Twitter] @cways_fr <https://twitter.com/cways_fr>

Mime
View raw message