spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: schemaRDD.saveAsParquetFile creates large number of small parquet files ...
Date Thu, 29 Jan 2015 18:52:45 GMT
You can use coalesce or repartition to control the number of file output by
any Spark operation.

On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel <manojsameltech@gmail.com>
wrote:

> Spark 1.2 on Hadoop 2.3
>
> Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
>
> It creates a large number of small (~1MB ) parquet part-x- files.
>
> Any way to control so that smaller number of large files are created ?
>
> Thanks,
>

Mime
View raw message