spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Creating Spark buckets that Presto / Athena / Hive can leverage
Date Mon, 17 Jun 2019 05:27:45 GMT
Hi Daniel,

not quite sure of this, but does Glue Data Catalogue support bucketing yet?
You might want to find that out first.


Regards,
Gourav

On Sat, Jun 15, 2019 at 1:30 PM Daniel Mateus Pires <dmateusp@gmail.com>
wrote:

> Hi there!
>
> I am trying to optimize joins on data created by Spark, so I'd like to
> bucket the data to avoid shuffling.
>
> I am writing to immutable partitions every day by writing data to a local
> HDFS and then copying this data to S3, is there a combination of bucketBy
> options and DDL that I can use so that Presto/Athena JOINs leverage the
> special layout of the data?
>
> e.g.
> CREATE EXTERNAL TABLE ...(on Presto/Athena)
> df.write.bucketBy(...).partitionBy(...). (in spark)
> then copy this data to S3 with s3-dist-cp
> then MSCK REPAIR TABLE (on Presto/Athena)
>
> Daniel
>
>

Mime
View raw message