spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From German SM <germanschia...@gmail.com>
Subject Re: Spark Small file issue
Date Tue, 23 Jun 2020 22:05:43 GMT
Hi,

When reducing partitions is better to use coalesce because it doesn't need
to shuffle the data.

dataframe.coalesce(1)

El mar., 23 jun. 2020 23:54, Hichki <harish.vs142@gmail.com> escribió:

> Hello Team,
>
>
>
> I am new to Spark environment. I have converted Hive query to Spark Scala.
> Now I am loading data and doing performance testing. Below are details on
> loading 3 weeks data. Cluster level small file avg size is set to 128 MB.
>
>
>
> 1. New temp table where I am loading data is ORC formatted as current Hive
> table is ORC stored.
>
> 2. Hive table each partition folder size is 200 MB.
>
> 3. I am using repartition(1) in spark code so that it will create one 200MB
> part file in each partition folder(to avoid small file issue). With this
> job
> is completing in 23 to 26 mins.
>
> 4. If I don't use repartition(), job is completing in 12 to 13 mins. But
> problem with this approach is, it is creating 800 part files (size <128MB)
> in each partition folder.
>
>
>
> I am quite not sure on how to reduce processing time and not create small
> files at the same time. Could anyone please help me in this situation.
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message