spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sathish Kumaran Vairavelu <vsathishkuma...@gmail.com>
Subject Re: How do we control output part files created by Spark job?
Date Mon, 06 Jul 2015 19:22:59 GMT
Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.kacha@gmail.com> wrote:

> Hi I am having couple of Spark jobs which processes thousands of files
> every
> day. File size may very from MBs to GBs. After finishing job I usually save
> using the following code
>
> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file
> as
> of Spark 1.4
>
> Spark job creates plenty of small part files in final output directory. As
> far as I understand Spark creates part file for each partition/task please
> correct me if I am wrong. How do we control amount of part files Spark
> creates? Finally I would like to create Hive table using these parquet/orc
> directory and I heard Hive is slow when we have large no of small files.
> Please guide I am new to Spark. Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message