spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arwin Tio <arwin....@hotmail.com>
Subject Parquet 'bucketBy' creates a ton of files
Date Thu, 04 Jul 2019 07:22:13 GMT
I am trying to use Spark's **bucketBy** feature on a pretty large dataset.

```java
dataframe.write()
    .format("parquet")
    .bucketBy(500, bucketColumn1, bucketColumn2)
    .mode(SaveMode.Overwrite)
    .option("path", "s3://my-bucket")
    .saveAsTable("my_table");
```

The problem is that my Spark cluster has about 500 partitions/tasks/executors (not sure the
terminology), so I end up with files that look like:

```
part-00001-{UUID}_00001.c000.snappy.parquet
part-00001-{UUID}_00002.c000.snappy.parquet
...
part-00001-{UUID}_00500.c000.snappy.parquet

part-00002-{UUID}_00001.c000.snappy.parquet
part-00002-{UUID}_00002.c000.snappy.parquet
...
part-00002-{UUID}_00500.c000.snappy.parquet

part-00500-{UUID}_00001.c000.snappy.parquet
part-00500-{UUID}_00002.c000.snappy.parquet
...
part-00500-{UUID}_00500.c000.snappy.parquet
```

That's 500x500=250000 bucketed parquet files! It takes forever for the `FileOutputCommitter`
to commit that to S3.

Is there a way to generate **one file per bucket**, like in Hive? Or is there a better way
to deal with this problem? As of now it seems like I have to choose between lowering the parallelism
of my cluster (reduce number of writers) or reducing the parallelism of my parquet files (reduce
number of buckets), which will lower the parallelism of my downstream jobs.

Thanks

Mime
View raw message