spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nsalian <nsal...@cloudera.com>
Subject Re: spark parquet too many small files ?
Date Sat, 02 Jul 2016 00:35:23 GMT
Hi Sri,

Thanks for the question.
You can simply start by doing this in the initial stage:

val sqlContext = new SQLContext(sc)
val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json
example here

where the argument is the path to the file(s). This will reduce the
partitions.
You can proceed with repartitioning the data further on. The goal would be
to reduce the number of files in the end as you do a saveAsParquet.

Hope that helps.



-----
Neelesh S. Salian
Cloudera
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message