spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <>
Subject Re: Creating Spark buckets that Presto / Athena / Hive can leverage
Date Mon, 17 Jun 2019 05:27:45 GMT
Hi Daniel,

not quite sure of this, but does Glue Data Catalogue support bucketing yet?
You might want to find that out first.


On Sat, Jun 15, 2019 at 1:30 PM Daniel Mateus Pires <>

> Hi there!
> I am trying to optimize joins on data created by Spark, so I'd like to
> bucket the data to avoid shuffling.
> I am writing to immutable partitions every day by writing data to a local
> HDFS and then copying this data to S3, is there a combination of bucketBy
> options and DDL that I can use so that Presto/Athena JOINs leverage the
> special layout of the data?
> e.g.
> CREATE EXTERNAL TABLE ...(on Presto/Athena)
> df.write.bucketBy(...).partitionBy(...). (in spark)
> then copy this data to S3 with s3-dist-cp
> then MSCK REPAIR TABLE (on Presto/Athena)
> Daniel

View raw message