spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin" <r...@databricks.com>
Subject Re: Avoiding unnnecessary sort in FileFormatWriter/DynamicPartitionDataWriter
Date Fri, 04 Sep 2020 17:33:21 GMT
The issue is memory overhead. Writing files create a lot of buffer (especially in columnar
formats like Parquet/ORC). Even a few file handlers and buffers per task can OOM the entire
process easily.

On Fri, Sep 04, 2020 at 5:51 AM, XIMO GUANTER GONZALBEZ < joaquin.guantergonzalbez@telefonica.com
> wrote:

> 
> 
> 
> Hello,
> 
> 
> 
> 
> 
> 
> 
> I have observed that if a DataFrame is saved with partitioning columns in
> Parquet, then a sort is performed in FileFormatWriter (see https:/ / github.
> com/ apache/ spark/ blob/ v3. 0. 0/ sql/ core/ src/ main/ scala/ org/ apache/
> spark/ sql/ execution/ datasources/ FileFormatWriter. scala#L152 (
> https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L152
> ) ) because DynamicPartitionDataWriter only supports having a single file
> open at a time (see https:/ / github. com/ apache/ spark/ blob/ v3. 0. 0/ sql/
> core/ src/ main/ scala/ org/ apache/ spark/ sql/ execution/ datasources/ FileFormatDataWriter.
> scala#L170-L171 (
> https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L170-L171
> ) ). I think it would be possible to avoid this sort (which is a major
> bottleneck for some of my scenarios) if DynamicPartitionDataWriter could
> have multiple files open at the same time, and writing each piece of data
> to its corresponding file.
> 
> 
> 
> 
> 
> 
> 
> Would that change be a welcome PR for the project or is there any major
> problem that I am not considering that would prevent removing this sort?
> 
> 
> 
> 
> 
> 
> 
> Thanks,
> 
> 
> 
> Ximo.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Some more detail about the problem, in case I didn’t explain myself
> correctly: suppose we have a dataframe which we want to partition by
> column A:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *Column A*
> 
> 
> 
> *Column B*
> 
> 
> 
> 4
> 
> 
> 
> A
> 
> 
> 
> 1
> 
> 
> 
> B
> 
> 
> 
> 2
> 
> 
> 
> C
> 
> 
> 
> 
> 
> 
> 
> 
> 
> The current behavior will first sort the dataframe:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *Column A*
> 
> 
> 
> *Column B*
> 
> 
> 
> 1
> 
> 
> 
> B
> 
> 
> 
> 2
> 
> 
> 
> C
> 
> 
> 
> 4
> 
> 
> 
> A
> 
> 
> 
> 
> 
> 
> 
> 
> 
> So that DynamicPartitionDataWriter can have a single file open, since all
> the data for a single partition will be adjacent and can be iterated over
> sequentially. In order to process the first row,
> DynamicPartitionDataWriter will open a file in
> /columnA=1/part-r-00000-<uuid>.parquet and write the data. When processing
> the second row it will see it belongs to a different partition, closet he
> first file and open a new file in /columna=2/part-r-00000-<uuid>.parquet
> and so on.
> 
> 
> 
> 
> 
> 
> 
> My proposed change would involve changing DynamicPartitionDataWriter to
> have as many open files as partitions, and close them all once all data
> has been processed.
> 
> 
> 
> 
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
> 
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to
> the sender that you have received this communication in error and then
> delete it.
> 
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação
> vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o
> comunique imediatamente por esta mesma via e proceda a sua destruição
>
Mime
View raw message