spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Mateus Pires <dmate...@gmail.com>
Subject Creating Spark buckets that Presto / Athena / Hive can leverage
Date Sat, 15 Jun 2019 12:29:35 GMT
Hi there!

I am trying to optimize joins on data created by Spark, so I'd like to
bucket the data to avoid shuffling.

I am writing to immutable partitions every day by writing data to a local
HDFS and then copying this data to S3, is there a combination of bucketBy
options and DDL that I can use so that Presto/Athena JOINs leverage the
special layout of the data?

e.g.
CREATE EXTERNAL TABLE ...(on Presto/Athena)
df.write.bucketBy(...).partitionBy(...). (in spark)
then copy this data to S3 with s3-dist-cp
then MSCK REPAIR TABLE (on Presto/Athena)

Daniel

Mime
View raw message